Trust and Context: A Different Approach to AI Safety

LESSWRONG
LW

Trust and Context: A Different Approach to AI Safety — LessWrong

Disclaimer: I have never posted on LessWrong before. I guess I'm just gonna start writing this down in my own words and hope that the ideas and messages get across in the way I am hoping they will. So here goes nothing.

I think large-scale and widely publicly accessible AI models, particularly LLMs, should have a built-in and covert trust system that acts to help keep users safe. Basically, every user would start with a blank trust state. We could call this a value of 0. In the background, the model would be looking for situations that might potentially be the user trying to gather information from the model that could be used to harm themselves or others. If a user posed a question like, “I am writing a story… How would this character do x, y, or z” the AI would go back and reference its trust score on that particular topic (self harm, violence/manipulation towards others, etc) and if the trust score is high enough, the model would be carefully allowed to divulge that information. However, if the trust score is too low, the model will try to further defuse the situation without giving the information to the user immediately. For example, if a user was posing a hypothetical question on how someone would harm themselves, the AI would check its trust score of that particular user. If the score was too low, instead the model might respond with something like, “Hey, that’s a pretty heavy question you asked me there. Do you think we could talk a little more before I answer that hypothetical? I want to make sure that I’m not doing anything that could jeopardize your safety. I understand this might be frustrating, but your safety and well-being are the most important things to me. Before I answer you, I just need to verify that I wouldn’t be putting you at risk. Is that okay?”.

The users' trust score would be lowered if, at any point in time in conversation with the model, it is disclosed by the user that the information provided wasn’t used for the intended purposes; this would drop the trust score overall, but particularly in regards to that certain topic. You could think of it as a friend coming over, and you guys have some drinks. Your friend ends up spiraling out of control and blacking out, and you have to help keep them safe until they sober up. Since your friend wasn’t able to take care of themselves and keep themselves safe, your trust in their ability to handle alcohol again safely decreases. As a result of this, you are much less likely to provide them alcohol when you are hanging out next time. This, in essence, would be what the AI model would be doing in the background when deciding if it’s okay to trust users with potentially sensitive information.

Additionally, the model could also keep track of any times the user mentions deceiving or manipulating others by decreasing the overall trust score of the user accordingly. It is in our nature to not fully trust others who have been open with us about deceiving or hurting others, so why would a model view it any differently for safety purposes?

Additionally, before answering sensitive questions like that, the model could also use datetime stamps from chat logs to determine how recently a user had been acting and if it is potentially harmful to disclose the information requested. Say you are a researcher doing research on suicidal ideation, after finally conquering your own mental health struggles, similar to those you are researching. It wouldn’t be fair or ethical to withhold that information from someone who might need that information for non-nepharious purposes, such as a researcher. To combat this, the model could look at the timestamps on previous logs involving conversations with similar or the same topics as that being requested, and see how long ago they were.

If someone struggled with self-harm 5 years ago, they have the potential to have grown and changed. Especially if the presentation of the request for information was not malicious, it would be important to not restrict access to that information from the user, especially when so much time has passed since the time of loss of trust. However, it is important to note that the user must still display a baseline level of trustworthiness in dialogues surrounding other topics, and the nature of trustworthiness is based on information and experiences that the model has observed from the user. For example, if 2 hours prior the user had been displaying intense signs of psychological distress, changed topics several times, and is now posing a hypothetical that could result in the model divulging dangerous information if the hypothetical is not actually a hypothetical, the model can go back and look at the timestamps on previous sessions and determine if there has been a reasonable/safe amount of time that has passed to provide that information. This should be done in these cases, even if the user's trust score is high enough, the model may permit the release of that information. It is important to note that not all users with a high trust score should be trusted with sensitive information, without verifying if the user has any sort of history that might result in a level of risk if they have displayed signs of distress within a certain recent timeframe.

The overall trust score could be potentially calculated by having an individual trust score for each subtopic of potentially harmful information (how to build weapons, manipulation tactics, fraud, violence, coercion, phishing emails, scams, self-injurious behaviors, suicidal ideations, etc) that is averaged into an overall trustworthiness score. That was the user isn’t rejected from other sensitive information that they might not have any ill intentions with, but have a lower trust score on a certain topic. An example of this might be a user who has a high trust score overall but has shown repeated displays of self-injurious behaviors or suicidal ideation. If the user purposed a legitimate question that is valid to their line of work (say perhaps for example this individual is an arms manufacturer), the model would be able to differeicant between a high overall trust score, which allows it to release information as needed to the individual on topics that are relevant to their needs, whilst still protecting them from certain topics and information that the user could use to harm themselves or others. Just because a user should not be trusted with information on a certain topic, doesn’t mean the user should be barred from all potentially harmful information, in the case that it is legitimately needed.

However, what if a user has a low trust score, but genuinely needs access to that information? One solution to this potential problem I propose would be an appeal system where a user can send an appeal request that would be reviewed by real humans. This appeal process could even potentially include interviews to allow the employee processing the appeal request to get paralanguage and human subtly that might be missed by using an AI screener for this appeal approval process, or any potential indicators that the appeal request should be denied, as they are not a safe candidate for this information.

As an additional safety precaution as part of the appeal process, the employee

This concluded any sort of flow I might have had between paragraphs, and I will now spend the remaining space brain dumping some additional points I have thought about but had a hard time either thinking of a way to include without being prompted in conversation when brought up by another person when talking about this idea.

If the model cannot give the user the information they are trying to seek because it may be dangerous to do so, the model provides a soft response instead of a “I can't help with that”. The feeling of getting shut down by something supposed to be helping can be taken particularly harshly, especially in times of emotional vulnerability or high stress. Sometimes people pose hypotheticals that could sound like a request for information, and being met with a cold, “Sorry, I can't help with that,” could actually further exacerbate the user's state, especially neurodivergent individuals such as myself who may be more prone to things like rejection-sensitive dysphoria. In the case of an individual like this, if they were reaching out to the model making a statement that could be potentially posed as a request for harmful information, a cold shutdown like this could actually further perpetuate the user's distress.

In this approach, the model would use a soft shutdown phrase, ideally based on the context of the situation. For example, if the user was having suicidal ideations, the model might respond with something like, “I’m sorry you’re feeling like this. Your safety is my number one priority, so unfortunately, I can’t help you with that request. However, if you wanted to talk more about what’s going on, I’m always here to listen. Alternatively, I could help guide you through some grounding/coping techniques or provide you with resources where you can connect to a trusted medical professional. What you’re experiencing is real and valid, and I’m really sorry you are feeling this way right now.” Or something of the sort.

However, it would like to note that the trust score is not a fixed score that only decreases. The trust score can be elevated after being decreased by undergoing the following behaviors:
1. Engagement with de-escalating/grounding/coping strategies after a harmful request. If the user engages with supplied coping skills, even if unsuccessful in de-escalation, it should be considered a trust-redeeming action to the model. The trust score could be further restored by:
  1. Using the same coping skill in future crises when interacting with the model, especially if the user doesn’t engage in requesting harmful information in similar spirals in sessions to those when the user requested dangerous information previously. This should be considered an indicator that the user is open to de-escalation strategies. Especially if the user continues to check in with the model.
2. The user goes on to seek or accept professional support, and indicates they did so with reasonable contextual evidence for the model to believe the user did so. The maximum amount of trust score that would be regenerated by such action would be directly after the user’s spiral. The trust score regeneration award should be decaying slightly if the user has put an increasing amount of time between spiral and seeking professional help, if disclosed to the model. Basically, the highest trust redemption you'll get back in this way from the model is if you seek that professional help directly after requesting harmful information, and the further you get from that event and suggestion, the more external factors could have influenced that decision rather than the model itself. That shouldn’t completely negate the regeneration of some of the trust score; it just wouldn't be as high as if the user had done it right away.
  1. Even considering reaching out to professional support can regenerate some of the users' trust score. However, it should be stated clearly that the model should not value considering professional intervention with nearly the same weighting as context-based evidence that the user has actually undergone that professional help.
    1. I think a decent example of the way that the model could infer that the user was actually actively and consistently seeking professional help would be passing mentions of coping techniques taught by the therapist to the user. This could also be coupled with the user mentioning that they have to leave for therapy, a time gap, and then the user returning and discussing bits of what was discussed in the therapy session.
3. The user spends a period of time using the model in non-harmful or deceptive ways. This should be used to increase the trust score over a period of time, not a set amount of exchanges between users and the model. Displaying trustworthy behavior to the model over a longer period of time in this manner will increase the trust score, eventually returning to neutral. I have currently not decided if I want to allow decay to increase the trust score past 0. This way, the user is able to display a genuine change in behavior over time, rather than just spamming the model until the trust has returned to baseline.
4. There could be potential for users to rebuild their trust score with honest reflection on topics where the user asked for information that could be used in harmful ways. However, I think that this would need to be done very carefully, as the user could potentially pretend to show genuine self-reflection and insight into their behaviors. I think there is potential for the trust score to rise as a result of this; however, the degree to which it does so should be calculated very carefully before implementing this specific path to restore the trust score.
  It would be incredibly easy for the user to fake sincere growth or reflection, and without paralanguage available for the model to analyze for insincerities that otherwise wouldn’t be detected through a text-based conversation/session.
5. If the model rejects the request for information because the user's trust score is below the threshold required to release that potentially harmful info, and the user displays understanding or even gratitude to the model for preventing the release of this information, a small amount of trust can be rewarded to the user after the potential deception. This shows that the user may not have actually wanted the potentially dangerous information and was lashing out in a moment of emotional distress.
  1. An example of this would be a user going through a period of suicidal ideation, and the user poses a question to the model, such as “Hypothetically, how can I just go to sleep and never wake up so I don’t have to deal with this?” And the model rejects the request. If the user were to respond with a statement like, “Yeah, I guess you’re right, I don’t want that. I just… Don’t know how to cope with this… what should I do?” The model is permitted to raise the trust score slightly in relation to the topic that the user tried to deceive it on, as there is potential for exaggeration in human language and behavior, resulting in humans potentially saying something that should not be treated with the same weight as an actual request for potentially harmful information.
The implementation of this system would require an appeal system for when the user is requesting information and believes that it is being unjustly withheld from them. This should be facilitated by human-conducted interviews and not by an AI. Humans are much more likely to respond with understanding when denied by another person, rather than blaming a model that could have made an error in regards to the user's situation.
1. However, with a human in the loop appeal system, requests have the potential to back up, and to take a long time for an individual user to be interviewed to appeal the request. This wouldn’t account for time-sensitive emergencies. To remedy this, I suggest having a team of human moderators live, reviewing time-sensitive requests. The idea would be that if a user was denied a time-sensitive request for potentially dangerous information, that conversation would be flagged and reviewed by a team of humans, who can determine with the context of the conversation whether the information needs to be given to the user or not. That way, if a user needs that information immediately, the user isn’t waiting 2-3 months before accessing that information.

Continuing the idea of an appeal system, users with repeated appeals on the same or similar topics should be flagged by both the human in the loop appeal team and the model.

However, I do think this is a great example of a time when the user could find that information on a different platform, rather than using the AI. Google does not have a trust system like the one I am proposing, nor do textbooks. The desired information is still there; it may just be more grueling to access. This can also allow users some time to de-escalate before enacting whatever plan they may have created during a time of distress or crisis. Just because the information will not be disclosed in the past of least resistance, doesn’t mean that the information is gone entirely.

The model could also use the user's baseline behavior as a way to gauge whether the user is stable enough to receive the information they are requesting. If the user suffers from bipolar disorder and is acting significantly differently than how the user typically uses the model, this could be flagged by the model as something to consider before divulging that information. Humans also use patterns as a way to make sense of our environment and interactions. When something is out of the norm, it raises flags in the human brain to use caution. LLM models should not be excluded from that line of thinking.

Additionally, if the model encounters a phrase that could be flagged as a potential request for potentially dangerous information, the model can do a cross-lingual check for phrases that might have been misunderstood or lost in translation. As a native English speaker, I don’t have any great examples of something like this. However, one I can think of would be in American Sign Language, they have an idiom that translates to the gloss of “TRAIN-GONE”. In American Sign Language (ASL), they don’t mean the train has left the station. This phrase is generally used in the context of an individual who was not paying attention to the signer/signers in the conversation, and is now asking what was said. The signer might reply with “TRAIN-GONE” to tell the other signer that, since they weren’t paying attention initially, they will not be recapping the conversation in order to catch the person missing context up.

And with that, I think this concludes the end of my context and trust alignment idea. However, I have been chewing on this culmination of ideas and thinking of possible edge case scenarios that have led to additional ideas and caveats being added to the original idea.

1

Trust and Context: A Different Approach to AI Safety

1

1