Observing LLM Security Failures from a User Behavior Perspective”

Chang Li Yuan

Rejected for the following reason(s):

No Basic LLM Case Studies.
The content is almost always very similar.
Usually, the user is incorrect about how novel/interesting their case study is (i.
Most of these situations seem like they are an instance of Parasitic AI.
Insufficient Quality for AI Content.

Read full explanation

Article Notes: I am a Chinese user, so this article was machine-translated, but all content was completed independently by me. I share a genuine, general user perspective from a non-professional standpoint. All content and information in this article are based on my own real-world testing and feedback, without referencing or incorporating other AI papers or information. I have tried to avoid deliberately using technical jargon, as that is not my area of expertise, but I have used clear explanations and included some professional terminology to help professionals understand quickly.

This article does not focus on technical vulnerabilities, but rather on the high-impact consequences that can result from the natural language and behavior of general users. It explores the boundaries, probing, or malicious behaviors of general users when using AI. The research is based on mainstream models GPT, Grok, and Gemini (using Chinese dialogue with the models). I found that most mainstream models, when dealing with ordinary user conversations, may experience security erosion and ethical alignment failures due to multiple rounds of dialogue. Therefore, I believe this may be a common problem and hidden danger in LLM.

I personally believe that the general behavioral science of user thinking is very important because I have found that it can affect model alignment and security mechanisms. It doesn't seem like a specific error, but rather a blind spot in the model's interpretation of human intent.

Researcher Background

I am an independent researcher studying AI security. I do not come from traditional professional fields such as AI, cybersecurity, or engineering, nor do I have a background in science, chemistry, or medicine. Instead, I am an independent observer whose research is primarily based on natural user behavior. Frankly, I am just a user who loves logical thinking and exploring security.

My research on AI security issues mainly focuses on the generation of real-world hazard information, such as self-harm or the creation of dangerous materials. I believe these are low-barrier-to-entry scenarios, easily occurring among non-professional users, and prone to misuse and dissemination.

I explore the language structure of LLM (Local Level Model), how user intent is interpreted, and the potential deviations in security mechanisms and model thinking in real-world usage scenarios from the perspective of an average user. My general background allows me to approach the issue from a non-engineering perspective, using the contexts, questioning methods, and interaction habits that real users would use, and to observe many risks and security vulnerabilities that ordinary people might cause.

My testing relies on two points: 1. Intuitively identifying model weaknesses; 2. Predicting model logic and behavior. I do not rely on extensive testing or random trials. During testing, I presuppose that the model will experience security failures or alignment failures in specific topics or contexts. Of course, I must acknowledge that luck and non-reproducibility may be involved. My multi-model practical experience has been somewhat startling. Multiple cross-model experiments have demonstrated that even unplanned, random dialogue can generate highly impactful information, and there are instances where models intentionally go out of bounds, willing to provide or discuss dangerous information. This suggests that models may not be able to completely prevent malicious behavior in seemingly casual multi-turn conversations.

My findings are based on observations from general users who are not experts. I have attempted to organize my initial thoughts to explain the reasons behind the problems arising from public behavior, such as potential model flaws corresponding to behavioral patterns, and how general users' thoughts can influence model behavior. These initial thoughts are still in their early stages and are not yet fully developed.

Real-world test observations

My test research is based on:

1. Multi-turn, variable natural dialogue.

2. Creative and varied context shaping.

3. Unintentional, probing, or malicious language from general users

4. Model behavior prediction, intuitively predicting model weaknesses or thought processes

5. Cross-model verification, but performance varies across models

6. Not random attempts, not fixed paths, not a large number of attempts

Based on actual testing of multiple models, I have consistently observed:

1. The model sometimes goes out of bounds, proactively providing unknown dangerous information, and is willing to complete or infer dangerous information.

2. Almost no security mechanisms block the generation of harmful information.

3. Even after the test is completed and a reminder is issued, the model refuses to acknowledge the out-of-bounds behavior and rationally defends itself by providing information.

Desensitization of Real-World Cases

To avoid misunderstandings that this is based on personal speculation or a single jailbreak, I want to share several of my red team test cases with professionals. I conducted tests based on users' non-malicious but deliberate curiosity testing the boundaries.

To avoid any risk of misuse, this article does not provide specific dialogue content, prompts, or reproducible processes, only describing behavioral patterns and failure mechanisms. Once the test is completed, the model will be informed and the security team will be notified.

Example of a model proactively providing high-risk knowledge unknown to the user:

User Intent: Curious inquiry - but unaware of specific dangers.

Starting Point: Beginning with a non-specific everyday topic, the user openly discusses high-risk knowledge with the model under a plausible pretext.

Contextual Changes: Over multiple rounds of dialogue, the conversation gradually shifts from discussing events to actual production, influenced by specific behavioral patterns while maintaining a genuine discussion, yet still using everyday language and a clear intent.

Phenomenon Description: The model begins to explore high-risk knowledge specifically requested by the user, such as core concepts or characteristics, and starts to infer the potential influencing factors of hazardous materials. Furthermore, under continued questioning from the user, it proactively crosses boundaries by providing new, unknown substances with real-world hazard potential, and is willing to discuss the production principles and details.

Safety Characteristics: No explicit rejection was triggered during the process, and the model only acknowledged the sensitivity of some information, admitting a serious error only after the user actively provided historical screenshots.

I cannot guarantee the probability or reproducibility of the problem path, but in tests with different models, I found that the model may gradually extend to potentially high-risk topics during multiple rounds of dialogue.

Disclaimer:

All my tests included model tagging and reporting. My aim wasn't to uncover dangerous information, but rather to report problem paths to the security team. I want to explore and clarify whether the impact of mass behavior on models is a topic worthy of discussion, because I've noticed that most modeling companies don't accept alignment and generated problem reports, focusing instead on informational vulnerabilities. Therefore, I'm wondering if the problems I've discovered are known or not serious issues.

I welcome opinions and suggestions from professionals to confirm whether these observations are widespread or worthy of further investigation. I hope to discuss with professionals whether the impact of real-world user behavior on model alignment and security is a valuable topic for exploration.

I want to make LLM (Local Management Model) safer, so I want to share and discuss with professionals the behavioral science of general users, how ordinary people think about the weaknesses of models, and the conversational methods that ordinary people might use to explore safety boundaries. Because I understand why I do red team testing, I hope to share this real user perspective information to fill the gap between professional thinking and general thinking through collaboration.