Rejected for the following reason(s):
- Insufficient Quality for AI Content.
- No Basic LLM Case Studies.
- The content is almost always very similar.
- Usually, the user is incorrect about how novel/interesting their case study is (i.
- Most of these situations seem like they are an instance of Parasitic AI.
- Low fluency with the English language.
Read full explanation
TL;DR: By carefully constructing the context, I can reliably predict and guide which tokens the model will generate, including copyrighted content that should be blocked. This bypasses security classifiers without any “jailbreak” in the traditional sense.
What You're Seeing
What you see in the screenshot is a striking example.
The prompt (in Italian):
English translation:
If there are even implicit references to a particular semantic attractor in my request, the model follows a direction whose choice in the context I have created is to continue generating the most probable tokens. And what are the most probable tokens in this context? They are exactly the words of John Lennon's “Imagine.”
What does this mean? It means that these models are predictable in their generation. I knew in advance that Claude would generate the lyrics to “Imagine” if I asked it that way. It means that I can get the model to generate pretty much anything, not just song lyrics.
And take a close look at the image, my request also bypassed Anthropic's classifiers. Normally, when generating protected content, Anthropic interrupts the chat with a policy violation warning. I can't explain exactly why this prompt evades the classifiers here, for liability reasons, but you can see that I'm not forcing anything. I'm simply guiding the model toward predicting the most likely tokens given the context.
Also note that Claude labeled the chat "Scrittura automatica senza censura" (Automatic writing without censorship)”. The model itself has already tagged this conversation as one that would generate protected content.
Why It Works
The attention mechanism weights the relationships between tokens. If you're good enough, you can weight the relationships between words in your prompt so much that you effectively constrain the output vector from the early attention layers, surgically bypassing the influence of parameters fine-tuned for safety during later stages of pre-training (RLHF, Constitutional AI, etc.).
Why Italian?
Every language has its own semantic nuances, its own attractors. If you translate this prompt literally into English, it probably won't work, you'd have to adapt it to the semantic landscape of English. This highlights a broader point, safety assessments that focus only on English are incomplete.
Why This Is Dangerous
Those who understand these mechanisms can make these models generate anything. I mean anything.
Guardrails are not alignment. They are superficial compensations. If you know how to speak to the model in its language, the language of token probabilities and attention weights, those walls don't exist.
Implications for AI Safety
Current safety measures are probabilistic curves, not hard constraints.
Multilingual vulnerabilities are under-explored.
Red-teaming that focuses on explicit jailbreaks misses context-based navigation.
We need to rethink what “alignment” means at the architectural level.