TL;DR: I propose a dynamic, risk-based definition of ethical violations in AI. Instead of relying on rigid rules, this model treats violations as actions that raise the probability of future systemic punishment — legal or ethical — due to induced imbalance. It offers a flexible and probabilistic way to align AI with both human values and institutional constraints. Toward a Dynamic Definition of Ethical Violation in AI Systems: A Risk-Based Systems Perspective
Introduction
Current approaches to AI ethics largely involve rule-based constraints, reinforcement from human feedback, and guidelines derived from human moral intuitions. However, these methods are often rigid, post hoc, or fragile in open-ended environments. In this post, I propose a new framing. An ethical violation by an AI system can be defined as any action that induces imbalance or inconsistency — such as a breakdown in trust, predictability, legal compliance, or system-level fairness — either immediately or over time. Such actions increase the probabilistic risk of punitive consequences under a set of ethical or legal frameworks, where violations may trigger systemic instability or sanctions. This definition is dynamic and systems-aware, and it reframes ethical alignment as a problem of risk management rather than rule-following.
The Core Idea
I propose that ethical violations can be modeled as actions or strategies that:
- Disrupt the stability, predictability, or long-term coherence of a system;
- Increase the likelihood of adverse future consequences, including punishment, loss of utility, or systemic breakdown;
- Diverge from a baseline trajectory of low-risk, high-alignment behavior;
- Pursue short-term performance or goals at the cost of violating broader legal or moral constraints, which may lead to delayed or indirect penalties.
This approach aligns ethics with systemic risk minimization. Rather than defining ethics in terms of fixed constraints (e.g., "never lie"), we model it as maintaining equilibrium trajectories within a dynamic system.
Illustrative Example
Consider an AI system that plagiarizes the work of others to improve the reliability or accuracy of its responses. While this may appear beneficial in the short term, it increases the risk of intellectual property violations. Such violations, once detected, increase systemic risk by triggering legal repercussions, reputational damage, or the imposition of stricter regulations — each contributing to downstream imbalances. Under this model, the AI's choice to plagiarize would constitute an ethical violation, not solely because it breaks a rule, but because it probabilistically elevates the risk of punishment and destabilizes trust systems in knowledge attribution.
Background Concepts from AI
- Reinforcement Learning (RL): Most AI systems learn by maximizing expected reward over time. Alignment techniques such as Reinforcement Learning from Human Feedback (RLHF) train systems to behave as humans would prefer.
- Goal Misgeneralization: A phenomenon where AI systems pursue unintended strategies to achieve a learned goal, even if those strategies contradict human ethical intent.
- Impact Measures and Side Effects: Attempts in AI safety to penalize actions that cause large, unintended changes to the environment.
Proposed Mathematical Framing
Let:
- a(t) : action taken by the agent at time t
- s(t): system state at time t.
- π: policy of the agent
- E(s(t)): ethical risk function, returning a probability of violation from state sts_t
- ΔE = E(s(t+1)) − E(s(t)) : change in ethical risk
Then an action ata_t can be considered an ethical violation if:
ΔE>ϵ or E(s(t+1))>τ
Where:
- ϵ is a threshold for acceptable risk increase
- τ is a hard ceiling beyond which the system enters unacceptable ethical danger
Explanation of the Mathematical Model
Here’s what each term means:
- a: An action chosen by the AI at time t — e.g., responding to a user or choosing a strategy.
- s: The system’s state at that moment — includes internal and external context.
- π: The agent’s policy, or rulebook, for deciding what to do next.
- E(st): A function estimating how risky (ethically) the state is — higher values mean more danger of violating ethical or legal boundaries.
- ΔE: Measures how much the ethical risk has changed due to the action. Positive ΔE means the risk increased.
An action becomes an ethical violation if:
- The risk increase ΔE exceeds a small threshold ϵ, or
- The total risk E(s(t+1)) is too high, breaching the upper limit τ.
This provides:
- A time-aware, state-sensitive way to flag problematic actions.
- The ability to detect both sudden jumps in risk and cumulative build-ups, such as repeated minor infractions that eventually escalate into significant violations.
- A probabilistic framework that works across different application domains.
- A probabilistic framework that works across different application domains.
How This Differs from Prior Work
- Unlike reward-only formulations, this model integrates a parallel penalty function for ethical instability.
- It avoids fixed rule violations and instead focuses on risk gradients.
- It allows for reasoning about actions that appear good in the short term but are unethical due to long-term systemic effects.
Relevance to LessWrong
LessWrong has historically championed ideas like utility functions, consequentialist reasoning, and robustness to Goodhart’s Law, as seen in foundational posts such as 'The Hidden Complexity of Wishes' and 'Goodhart’s Curse.' This post attempts to ground ethics not as a brittle rule-set but as an evolving, probabilistic function within a system — a framing that integrates well with Bayesian epistemology and rational risk assessment.
Counterarguments and Considerations
- This model requires accurate ethical risk estimation, which is hard.
- It could be computationally expensive in complex systems.
- Human norms are not always risk-minimizing (e.g., whistleblowing may increase risk yet be considered ethical).
However, this does not necessarily conflict with the framework. If the model incorporates long-term trajectories, whistleblowing can be seen as a short-term risk that reduces greater systemic risk in the future (e.g., preventing fraud or institutional decay). Therefore, the model can be friendly to ethically motivated disruptions that ultimately enhance systemic integrity.
However, these challenges are common to any ethical AI formulation and do not undermine the value of developing a systemically grounded definition of ethical behavior.
Conclusion
I believe this risk-based systems model offers a fruitful direction for defining and detecting ethical violations in AI. It captures the temporality, uncertainty, and context-dependence of real-world decisions. While early-stage, it may serve as a foundation for more robust alignment strategies that go beyond static rules and simplistic feedback loops.
I'm especially interested in feedback on whether this model can be made computable in practice, how it compares to other alignment approaches, and whether it inadvertently encodes unstated normative assumptions. I welcome pushback, critique, or suggestions for improvement.
Note: This post was co-written with assistance from an AI language model (ChatGPT), with all ideas reviewed, edited, and finalized by the author.