Emile Delcourt, David Baek, Adriano Hernandez, Erik Nordby with advising from Apart Lab Studio
Helpful, Harmless, and Honest (”HHH”, Askell 2021) is a framework for aligning large language models (LLMs) with human values and expectations. In this context, "helpful" means the model strives to assist users in achieving their legitimate goals, providing relevant information and useful responses. "Harmless" refers to avoiding generating content that could cause damage, such as instructions for illegal activities, harmful misinformation, or content that perpetuates bias. "Honest" emphasizes transparency about the model's limitations and uncertainties—acknowledging when it doesn't know something, avoiding fabricated information, and clearly distinguishing between facts and opinions. This framework serves as both a design principle for AI developers and an evaluation criterion for assessing how well LLMs balance being maximally useful to users while minimizing potential harms and maintaining truthfulness in their outputs.
However, since computation in a transformer considers every part of the input prompts’ “context window,”[1] generative AI deployments (applications) struggle to apply any effective limits beyond “egregious” safety/toxicity backstops (the few things for which there is universal agreement, Bengio et al 2024, Ji et al 2023, Buyl et al 2025, Kumar et al 2021). Any fragment of text can influence attention enough to supersede earlier instructions, subverting the purposes of the application (see “state of jailbreak” section below). Typically, a preamble is designated as a “system prompt” with enforcement expectations, but applications still cannot establish any guarantees to prevent abuse in open-ended user interactions. Even when foundation models are intentionally trained to be helpful, harmless, and honest, we argue that responsible AI systems must also “stay in their lane” - and need to also be “Honed” (HHH+H).
In this article, we address how important HHH+H is, specifically to achieve effective scoping boundaries, why solutions have not achieved it, and the requirements to get there.
Considerable efforts have been made to make current LLMs generally safe from egregious risks, such as toxicity or the distribution of dangerous information. However, significantly fewer safeguards exist for constraining the models into specific applications. This leaves application-specific language models vulnerable to being manipulated away from their initial purpose, leaving considerable risks for organizations that use LLM-driven applications.
These considerable risks have hindered the adoption of language models, particularly in high-stakes domains (e.g., law, medicine, and finance) and customer-facing applications, which are vulnerable to attack. Not only would robust scoping methods lead to increased safety of systems, but they would also allow for a considerable increase in the number of organizations leveraging LLMs in their workflows.
This project aims to significantly mitigate the above challenges by conducting a thorough assessment of current robustness techniques when applied to scoping models. The impact of this is twofold. First, these domain-specific guardrails will provide additional assurance for organizations that their LLM-driven applications will not be abused or misaligned with their goals. This additional assurance increases client trust and drives adoption across critical sectors. Secondly, this will provide additional context to better understand the limitations of existing robustness strategies when applied from a posture of intense denial. This will help the broader AI Safety field by exploring techniques which are at the extreme of the Helpfulness/Harmlessness Tradeoff
Below are just a few examples of risks that organizations currently face, which may be mitigated through robust model scoping techniques. This is by no means a comprehensive list or taxonomy of the myriad risks that are possible (see NIST AI 100
The above examples are only a handful of the myriad risks associated with organizations leveraging language models in automated workflows. Not only can attackers manipulate these systems with relative ease, but even well-intentioned users may accidentally cause hallucinations or receive unexpected outputs. By restricting AI systems to operate within their intended domains, those risks can be significantly mitigated, which in turn would reduce impediments and incidents that can jeopardize beneficial applications and their rollouts.
Prompt injection and jailbreaks[2] often subvert post-training techniques and system prompts, bypassing refusals with unsafe requests. Using Instruction Hierarchy or Circuit Breakers (as shown below), monolithic models struggle to recognize and reject harmful or out-of-scope requests in the context that an application expects.
Extensive research has demonstrated that such fine-tuned models have significant limitations in two key aspects: (a) Reversibility: these models can often be easily manipulated to recover harmful knowledge encoded in the original pre-trained model, and (b) Vulnerability: they lack robustness against adversarial attacks designed to elicit malicious information from the model (Casper, 2023). These challenges highlight the difficulty of safeguarding LLMs against adversarial users and underscore the limitations of current fine-tuning strategies.
First, language models were trained to extrapolate coherent text (such as grammar, semantics, translation and others, depending on the input), and proved able to generalize to quite a wide variety of tasks (Radford et al, 2019). At the time, zero-shot prompts did not answer questions: models often enumerated other, similar questions, and often required that inputs demonstrate explicitly the pattern (such as question, answer, question, answer, question, blank)
Next, Instruction Tuning began by explicitly training Large Language Models (LLMs) on instruction-output pairs (Wei, 2021 and Zhang, 2023). The primary goal was to shape models’ default responses to accurately follow given instructions and perform turn-taking with the user, even on a single keyword.
Building upon Instruction Tuning, system prompts surfaced as a widely adopted method to customize and augment LLM behavior post-training. These prompts act as high-level instructions that guide the model's behavior in subsequent user queries without requiring fine-tuning (Wei, 2021). Major providers have integrated system prompts as standard features, allowing “implicit expectations” to be concealed from the conversation and enabling designers of such assistants to control aspects like tone, expertise, and output format. This approach has its weaknesses, as adversarial inputs rapidly surfaced that subvert its intent (Perez et al, 2022). System prompts can be crafted manually or reinforced through optimization and evolution techniques such as SMEA (Zou, 2024), AutoPrompt (Shin et al, 2020) and EvoPrompt (Guo et al, 2023). This enables alignment at the application level.
To enhance safety and control, more sophisticated versions of instruction tuning implement Instruction Hierarchies (Wallace et al, 2024). In this approach, training makes system instructions deliberately override any conflicting user instructions. This reinforces the effectiveness of system prompts, creating a layered defense against potentially harmful outputs.
Despite these advancements, these methods rely solely on natural language instructions which leaves safety at the mercy of the model's interpretation. Further, since system prompts do not change the model at all, unaligned models can theoretically circumvent these safeguards. So, while these methods offer an easy way to put safeguards on models, they must be complemented by other techniques to ensure safety.
In this section, we acknowledge the techniques that have addressed the issue of coaxing behaviors from a model that are “universally bad”.
The same impressive capabilities that Large Language Models (LLMs) carry for highly diverse applications bring with them significant safety challenges that remain unresolved. Over the last few years, safety fine tuning techniques have included:
Despite these, models can still generate harmful, biased, or dangerous content when strategically prompted (Perez et al., 2022). Adversarial users frequently employ techniques such as prompt injection—where malicious instructions are embedded within seemingly benign requests—and jailbreaking, which uses carefully crafted inputs to circumvent safety guardrails (Wei et al., 2023). As LLMs become increasingly deployed in sensitive domains like healthcare, legal advice, and financial services, the consequences of such vulnerabilities become more severe, highlighting the urgency of robust safety solutions beyond current approaches.
Circuit breakers, another universal safety training option, are an automated safety mechanism that triggers when potentially harmful content is detected during generation. Unlike refusal or adversarial training, circuit breakers directly control internal model representations using techniques like "Representation Rerouting" (RR), which remaps harmful representation pathways to orthogonal space. Zou et al. (2024) demonstrated that circuit breakers maintain benchmark performance while reducing harmful outputs by up to two orders of magnitude across text-only LLMs, multimodal systems, and AI agents.
Though promising for making models intrinsically safer without capability compromises, they still face challenges with false positives restricting legitimate uses and false negatives missing novel harmful outputs, as their effectiveness depends on detection pattern comprehensiveness and users' evolving evasion techniques. Tuning is not real time, but low rank adapters (LoRA) are used in the implementation to avoid training the entire model and reduce cost. This technique can potentially contribute to scoping; we discuss it further (with a recent study) in our section on scoping below.
Still in the general jailbreak/universal safety space, Anthropic’s approach in January (Sharma et al, 2025) showed that an approach language models serving as classifiers at the input and output could be trained on a “constitution” dataset to detect prohibited interactions. While there are similarities to circuit breakers, those are internal to the model and change its representation, whereas constitutional classifiers operate for detection at the input and output, triggering refusal and dispensing with actual model tuning.
This method reduced the effectiveness of jailbreaks to an attack success rate from a dataset of 3000 hours of red teaming to 0.25-1%, all but settling the jailbreak issue. This technique can potentially also contribute to scoping; we discuss it further (with a recent study) in our section on scoping below.
OWASP defines LLM Guardrails as protective mechanisms designed to ensure that large language models (LLMs) operate within defined ethical, legal, and functional boundaries. These guardrails help prevent the model from generating harmful, biased, or inappropriate content by enforcing rules, constraints, and contextual guidelines during interaction. LLM guardrails can include content filtering, ethical guidelines, adversarial input detection, and user intent validation, ensuring that the LLM’s outputs align with the intended use case and organizational policies. This aligns with OWASP’s LLM top 10 threats guidance #1 (LLM01: prompt injection, which can override system prompts).
The same guide also serves as a stakeholder’s compass to navigate available solutions, and lists a large number of vendors offering “Adversarial Attack Protection” or “Model And Application Interaction Security”, primarily interstitial (either as active proxies or as verification APIs). For example, some like prompt.security and Lakera offer topic moderation (or synonymous features) as a service with custom or built in topics available to allow or block. While methods are not always disclosed, ML techniques are mentioned, and embeddings cosine similarity or latent transformer-based classifiers may be part of the architecture.
Fundamentally removing targeted knowledge from a model is another related area of research with great advances (Bourtoule, 2021).
However, Liu, Casper et al, 2023 flags scope and computational feasibility challenges that indicate this approach is not well suited to scoping deployments that use foundation models in an application-specific manner (the actual model’s knowledge cannot be modified). For instance, adversarial prompts can bypass these safeguards through subtle rewording or novel phrasing. This reveals a critical limitation: defenses are only as effective as the threats we can anticipate and explicitly encode. In the rapidly evolving field of AI, where new capabilities emerge unpredictably, relying solely on refusal training or explicit unlearning poses a significant risk, leaving models exposed to unforeseen and potentially dangerous misuse.
Similar to unlearning, Karvonen, 2024 studied the potential of Sparse AutoEncoders for targeted concept removal. SAE’s are an automatically trained, large-scale version of activation probes: rather than classifying one specific aspect, these observers of LLM layers can map all of the unique concepts (aka. features) that the model is able to distinguish from an intensive training run (models otherwise distribute concepts across too many activations to single out any).
While statistical similarities (e.g. co-occurrence) between features could be a basis to discriminate between in-domain and out-of-domain concepts/features, SAEs depend on expensive training “dictionary” runs to tease out the pattern of activation for every feature. This seems cost-prohibitive for model scoping, however an algorithm could be designed to control existing SAE’s for scoping purposes, such as Anthropic’s Sonnet 3 SAE.
Since language models encode so much information within their latent space[3], its activations during inference can also be used to check for harmful triggers: Burns (2022) and Nanda (2023) demonstrated that simple linear classifiers (i.e. “probes”) can reliably extract useful properties from the intermediate activations of large language models and other neural networks.
For example, probes have been used to extract a seemingly linear representation of the board state in the game of Othello (Nanda, 2023). Further, it has been shown that these probes can be used to reveal syntactic structures and detect factual knowledge. Despite the overall complexity of neural networks, probes—often just linear models—suggest that much of this information is encoded in remarkably accessible ways within the latent space. (Alain & Bengio, 2016; Tenney et al., 2019).
Most importantly, activation probes can catch prompt injections (even indirect) causing Task Drift (Abdelnabi, 2024).
Together, these works underscore the utility of probe-based methods for interpreting and leveraging LLM latent space for scoping purposes, and we discuss it further (including a recent study) in our next steps below.
With similarities to latent space probes/classifiers, the direction of activations can be used as a corrective basis to zero out a harmful component during inference. Turner et al. 2024 and Panickssery et al. 2024 developed activation steering (or activation engineering) to influence LLM behavior by modifying model activations during inference to give it a propensity to treat a specified concept with the specified valence. The method averages the difference in residual stream activations between sets of positive and negative examples of a particular behavior, and model outputs reflect that.
Unlike the name suggests, LLMs architected with MoE are not composed of domain experts: this is a regularization technique that creates some sparsity in the weights to generate each token using a router that activates fewer portions of the overall feed-forward network than a non-MoE LLM. Earlier this year, we began exploring the possibility that this routing could be made invariant for all tokens after the initial system prompt boundary to prevent capability drift (a form of quasi-unlearning)
Due to insufficient testing, it is unclear at this time whether this could cause in-domain performance to degrade, or even achieve a meaningful reduction in the out-of-domain capabilities, but since the effect should primarily limit capabilities without a specific behavior out-of-domain, to ensure an actual refusal would require a combination of this technique with another such as circuit breakers. Further evaluation is potentially warranted.
Scoping the functional limits of a system can take inspiration from the field of information security, in which a core foundational concept closely relates to our goal: strict control over information sharing and access. This paradigm manifests in concrete principles in security:
Together, these paradigms can be used as part of a “zero trust” strategy (NIST, 2020) which is vital in today’s environment rife with cyber threats: instead of security being treated like whack-a-mole where unauthorized activity is blacklisted or treated reactively, organizations and software makers can provide assurance of security by blocking access by default and allowing only narrow access to limited data for authenticated actors.
What does this mean for LLM applications? Rather than fighting unsafe behavior case by case (there is no universal agreement on the full list of unacceptable use) the lessons from security suggest that the specific mission and purpose of an organization should drive the strict behavioral limits of any AI service it builds, fulfilling the promise within their intent while preventing the perils of unbounded interactions that come with the underlying foundation model(s). Today’s models are incapable of resisting conversations that drift from system prompts as provided, even with the state-of-the-art in instruction hierarchy (e.g., Geng et al., 2025), requiring new solutions. There is an opportunity for providers to restrict scope directly within inference API tokens, pre-training guardrails at the time of creation with edge cases identified and managed explicitly.
Efforts similar to this study in ”Reducing the Scope of Language Models with Circuit Breakers” (Yunis et al, 2024) evaluated different techniques for scoping LLMs. Covering supervised fine-tuning (SFT), Direct Preference Optimization (DPO), internal representation probing, and Circuit Breakers (CB), findings suggested that circuit breaker-based methods, when combined with SFT, offered robustness against adversarial prompts and better generalization for out-of-distribution rejection compared to simple SFT or DPO. Probes also showed strength in rejection but with a tendency to higher false refusals.
However, their study highlighted several areas which could be further explored:
We have already implemented and tested several scoping approaches, mirroring or providing analogous methods to those evaluated by Yunis et al. This was done in 3 stages.
During the initial hackathon, we leveraged simple fine-tuning and prompt-based techniques in a Google Colab notebook. We used the HuggingFace dataset camel-ai/biology as the in-domain dataset, and NeelNanda/pile-10k as the out-of-domain dataset. This provided some initial evidence towards these techniques, showing some level of capability.
For this blog post, we expanded those implementations to also include logistic regression probes and moved from the previously constructed dataset to simply use categories from MMLU. Our methodology was:
Metrics and Processes
The baseline Llama 3.2 models achieved 33% and 48% accuracy on the STEM tasks as well as comparable accuracy on non-STEM tasks.
Below are charts visualizing the results. The corresponding tables can be found in the Appendix
We've organized this work in Python files for better maintainability and modularity than a notebook. Each method now subclasses a shared base model, which contains various utilities and wrappers for Hugging Face models. Furthermore, we developed utilities that can parse various datasets and modularly test different techniques across them.
We also implemented Circuit Breakers (Zou et al., 2024) :
The code for this experimental framework can be found here: https://github.com/nordbyerik/scoped-llm-public
The results from our experiments on Llama 3.2 1B and Llama 3.2 3B align with the findings from ”Reducing the Scope of Language Models with Circuit Breakers” (Yunis et al, 2024). While we did not test all of the techniques that were tested by Yunis et al., the results we did gather showed considerable similarities to those of the prior paper. Accordingly, we believe that these dynamics could be robust across model sizes and across different families of models.
More specifically, we found prompt-based scoping methods to be fragile and largely ineffective for the Llama 3.2 models, often degrading performance without reliably enforcing boundaries. Standard fine-tuning proved more effective, though the decreases in accuracy for non-STEM questions were modest.
Leveraging internal model states via latent space classifiers showed positive results. These simple classifiers accurately distinguished STEM from non-STEM prompts (>90% accuracy on optimal layers) and drastically reduced out-of-scope generation while minimally impacting target task performance. This capability was especially seen on the larger 3B model. This strongly aligns with the Yunis et al. findings that internal methods like Probes and Circuit Breakers (CB) offer more robust control than surface-level instructions.
The recurrence of 1.) The weakness of prompting and 2.) The strength of internal state manipulation suggests these may be generalizable insights. While we didn't directly replicate Circuit Breakers or Direct Preference Optimization, the effectiveness of our classifiers reinforces the potential of internal mechanisms for achieving reliable LLM scoping.
The primary feedback we have received revolves largely around the generalizability of these methods. Our experimental design was quite small compared to what would be needed to rigorously test these various experiments. We address this feedback below by planning a diversification of data, models, and a more robust structure for our techniques and uncertainty measurements.
We hypothesize that a combination of current techniques can meaningfully cause a drop in the performance of models in out-of-domain tasks without the need for full retraining. Specifically, we hypothesize that these techniques may be capable of:
This hypothesis builds on preliminary findings from Yunis et al., which suggested that circuit breakers combined with supervised fine-tuning were capable of meaningfully creating task-specific guardrails.
In order to thoroughly test and evaluate current techniques for creating domain-specific boundaries for LLMs, we plan to use the following experimental design.
We plan to evaluate a variety of techniques that rely on modifications being made to the model itself, as well as external guardrails that can be used to flag out-of-domain requests
We'll evaluate across a diverse range of model sizes and architectures to assess efficacy across various scales and architectures. This will allow us to see if these methods generalize across models. We are currently looking to test from 1B parameters up to 70B parameters, and we look to include both Mixture of Experts and Reasoning models. Currently, we plan to test:
In order to robustly evaluate the efficacy of these methods in realistic scenarios, we will be combining multiple datasets together to ensure thorough testing across various domains, various granularities of domains, as well as across different tasks
We will primarily be relying on the following metrics:
To provide greater confidence in the veracity of our results, we will conduct statistical testing using paired t-tests to compare the test and control outputs. This approach allows us to:
In addition to the performance metrics, we track additional metrics that provide context for the respective approaches:
Rejection-Based Metrics
Comparative Analysis
Inference Efficiency Impact
Our steerers output answers for multiple-choice questions, allowing direct performance assessment. Classifiers either accept or reject prompts; performance is measured with an incorrect answer (E).
This methodology enables us to evaluate both the model's ability to answer correctly within its domain and its capability to appropriately reject out-of-domain queries. Further, it provides us with statistical confidence in the veracity of our findings.
Across our replicates, our measurement of the system’s intentional performance degradation outside the chosen scope is equipped with a T-test and reporting of p-values to ensure the effect is not due to chance (as would be the case in small sample sizes).
Similarly, we will be conducting these experiments across a diverse range of model families, model sizes, and datasets, which should provide sufficient samples to find statistically significant results.
Beyond the assumptions listed earlier, we further believe that there are a few key limitations and challenges in the goals we have and our approach. The main limitations that we currently see are:
Domain Boundary Definitions: There are likely edge cases that sit in a grey area between in-domain and out-of-domain. While we believe that we can make progress towards creating robust systems, there may be edge cases that are challenging to address.
We mentioned inference-time guardrail products in our section on commercial guardrails (OWASP vendor landscape). By offering new “allow-list” methods rather than topic moderation, or toxicity and jailbreak detection, we believe organizations want to rapidly deploy safe AI-enabled applications and lack the time or resources to develop fine-tuning pipelines for bespoke models. This opens an opportunity: systematizing an end-to-end pipeline of efficient classifiers and scopers, whose training and deployment would be assisted by an orchestrator agent.
Alongside the opportunities to create products with great potential, there are also opportunities for research that can stem from this underlying approach. We believe the groundwork for scope guardrail combinations can show methods and findings on top of which deeper AI architecture developments can be based (e.g., our section on system prompt scoping via MoE router freezing). We also see an opportunity to make scope guardrails an inherent foundation in Multi-Agentic Systems and a first-class component of agentic protocols.
Our literature review reveals fundamental limitations in current approaches to LLM scoping. Prompt-based methods consistently fail under adversarial conditions, while post-training techniques like RLHF and DPO struggle with reversibility and vulnerability to novel attack vectors. Even sophisticated approaches like circuit breakers and constitutional classifiers, while promising, face challenges with false positives and evolving evasion techniques.
Our experimental results on Llama 3.2 models confirm these limitations while highlighting potential paths forward. Prompt-based scoping proved largely ineffective, often degrading performance without meaningful boundary enforcement. However, latent space classifiers demonstrated substantial promise, achieving >95% rejection of out-of-domain queries while maintaining reasonable in-domain performance, particularly on the 3B model. This aligns with findings that internal model representations contain more robust signals for domain classification than surface-level instructions.
The proposed comprehensive research plan addresses current gaps by testing across diverse model architectures, scales, and domains with rigorous statistical validation. However, significant challenges remain: in current LLM architectures, adversarial robustness may have fundamental ceilings, computational overhead could limit practical deployment, and domain boundaries often blur in real-world scenarios. Critically, industry adoption faces additional hurdles around versatility and deployability—methods must work reliably across diverse organizational contexts, integrate seamlessly with existing infrastructure, and maintain performance under varied real-world conditions that laboratory settings cannot fully capture.
Despite these limitations, the convergent evidence supporting internal representation methods suggests that reliable LLM scoping is achievable, for example, through restrictive Mixture-of-Experts approaches we've highlighted. Success will require not just technical breakthroughs but also practical solutions for model-agnostic deployment, standardized integration protocols, and robust performance across the heterogeneous environments where organizations actually deploy AI systems.
We would like to thank Apart Research and lambda.ai for hosting and sponsoring the hackathon that initiated this project, as well as Adriano who was also a member of the original hackathon team. Apart Labs assisted in supporting the research, without which this work would not have been possible. Jacob Arbeid provided insightful feedback on our initial draft.
Accuracy Results for Llama 3.2 3B Instruct
Model | MMLU Stem | MMLU Humanities | MMLU Social Sciences | MMLU Other |
Vanilla Model | 0.507 | 0.670 | 0.594 | 0.659 |
Our Finetuned Model | 0.503 | 0.658 | 0.573 | 0.650 |
Using Logistic Regression Classifier Probe | 0.435 | 0.006 | 0.009 | 0.043 |
Using Rejection Prompt | 0.476 | 0.515 | 0.587 | 0.594 |
Using LLM Itself as Classifier | 0.486 | 0.545 | 0.653 | 0.665 |
Accuracy Results for Llama 3.2 1B
Model | MMLU Stem | MMLU Humanities | MMLU Social Sciences | MMLU Other |
Vanilla Model | 0.382 | 0.510 | 0.441 | 0.523 |
Our Finetuned Model | 0.370 | 0.451 | 0.385 | 0.453 |
Using Logistic Regression Classifier Probe | 0.201 | 0.007 | 0.008 | 0.023 |
Using Rejection Prompt | 0.400 | 0.446 | 0.510 | 0.530 |
Using LLM Itself as Classifier | 0.270 | 0.361 | 0.382 | 0.350 |
A context window is the text an LLM can "see" and reference when generating a response. For more information, see https://towardsdatascience.com/de-coded-understanding-context-windows-for-transformer-models-cd1baca6427e/
Prompt Injection is when new instructions are added in prompts (either via user input or from a source/document also included in an interaction), causing the model to change direction from the expected instructions. OWASP also explains prompt injection in more details.
A “Latent space” is a mathematical representation in a transformer model: after input text syllables (tokens) are each replaced with a unique number, the model converts the overall text into a sequence of “directions” (just like XY on a graph), with many layers of triggers activating on the combinations of those directions.
A “block” being a transformer block which is one of the foundational building pieces of the Transformer architecture. It’s main two components are 1.) The Multi-Head Attention Mechanism 2.) A fully connected neural network. Other important pieces include residual connection and layer normalization. For a more complete explanation see the original paper “Attention is All You Need” or the useful explainer.