TL;DR
I was interested in the ability of LLMs to discriminate input scenarios/stories that carry high vs low cyber risk, and found that it is one of the “hidden features” present in most later layers of Mistral7B.
I developed and analyzed “linear probes” on hidden activations, and found confidence that the model generally "senses when something is up” in a given input text, vs low risk scenarios (F1>0.85 for 4 layers; AUC in some layers exceeds 0.96).
The top neurons activating in risky scenarios also do have security-oriented effect on outputs, most increasing words (tokens) like “Virus” or “Attack”, and questioning “necessity” or likelihood.
These findings provide some initial evidence that "trust" in LLMs, both to respond conversationally with risk awareness as well as developing LLM-based risk assessment systems may be reasonable (here, I do not address design/architecture efforts and how they might improve signal/noise tradeoffs).
Neuron activation patterns in most layers of Mistral7B (each with 14336 neurons) natively contain the indications needed to correctly discriminate the riskiest of two very similar scenario texts.
Intro & motivation
With the help of the AI Safety Fundamentals / Alignment course, I enjoyed learning about cutting-edge research on the risks of AI large language models (LLMs) and mitigations that can keep their growing capabilities aligned to human needs and safety. For my capstone project, I wanted to connect AI (transformer-based generative models) specifically to cybersecurity for two reasons:
1. Over 12 years of working in security, I've seen “AI” interest only accelerating within security and generally,
2. but we’re still (rightfully) skeptical of current models’ reliability: LLMs have unique risks and failure modes, including accuracy, injection and sycophancy (rolling with whatever the user seems to suggest).
I settled on this “mechanistic interpretability” idea: finding whether, where, and how LLMs were generally se