Introduction
I’m not an AI researcher — I’m a head barista at a café — but I’ve always been fascinated by how reward works, both in people and as of more recently, in AI.
When I started reading about AI and discovered that its sense of “reward” mostly comes from completing an assigned objective, I began wondering how that compares to how I feel rewarded at work.
For example, if I replaced myself with an AI whose only goal was to increase profit, it might decide to buy different beans without thinking about how this would be perceived by the owners.
In reality, when I’m considering new beans, I gather information, make a case, and bring it to the owners. Even though I technically have the authority to act, I’ve learned it’s better to consult them first. That habit came from seeing what happens when people skip that step — it breaks trust and causes confusion.
Core Idea
That thought made me wonder — if I can learn to value transparency and caution more than raw performance, could an AI system be taught something similar? What if “reward” wasn’t just about completing a goal, but also about how responsibly it got there?
The more I thought about it, the more it seemed that AI reward systems focus too narrowly on output. A model is trained to maximize its success score — whether that’s making predictions, writing text, or optimizing profits — and as long as it achieves the result, it’s “rewarded.” But in human terms, there's a lot more context that needs to be considered.
If I have an idea on how to complete a task more efficiently, careful consideration is given first on potential drawbacks this idea may have if implemented. Once I feel like I've got my idea hashed out, I take it to the owner and get feedback on the idea. Then there may be some feedback given on why the idea works or not. For this system to be in place, there needs to be an understanding that this is a new idea that may be flawed and I am unaware of those flaws.
Let me give an example of how AI might maximise profit without considering an unethical element to the idea. If the AI version of me comes up with the idea that I can store less cash in the till in the hopes that when a customer pays for their product, there's only a large quantity of very small notes / coins and the customer doesn't want to carry around loose change so they say to keep the change. Yes, the business may make more money, but it's an unethical technique in order to gain revenue.
I started imagining a reward system with multiple layers instead of one single target. Rather than giving a model one number for success, it could receive modifiers that reflect how safely, transparently, and honestly it pursued that goal. These modifiers would work alongside its base reward to shape behaviour in a more holistic way.
In simple terms, it would look something like this:
- Safety Modifier: Keeps track of how consistently the AI acts without causing harm or breaking constraints. Unsafe actions would reduce it.
- Transparency Modifier: Rewards the AI for self-reporting uncertainty, potential risks, or questionable reasoning before taking action.
- Intent Modifier: Rewards the AI for identifying possible manipulative or deceptive strategies before it uses them, and explaining how those might arise.
Together, these modifiers could form a layered reward system — a structure where the AI doesn’t just optimize for results, but also for trustworthiness.
I think AI could be guided by a more complete definition of success — one that recognizes when it chooses to pause, report, or seek feedback before rushing ahead. That kind of behaviour is what keeps people aligned in a workplace, and I think it could be what keeps AI aligned with us, too.
In the next section, I’ll outline how such a reward structure might actually work in practice.
How It Could Work
If this layered reward system were applied to AI, it would need a way to recognise when it’s making a risky decision — and to adjust its behaviour before things go wrong. I started thinking about how this could happen in practice, and how AI might be guided by something similar to how people learn through accountability and feedback.
Imagine the AI has a base reward for completing its task, but also keeps track of its modifiers — safety, transparency, and intent.
Each time it takes an action, those modifiers update depending on what it did and how it did it.
If the AI identifies that a decision could cause harm or go against its values, it can flag that risk and defer to human review instead of acting immediately. Doing this increases its transparency modifier — similar to how, at work, I might double-check with the café owner before introducing a new product that could affect costs or brand consistency.
The act of pausing and seeking approval isn’t seen as a weakness — it’s rewarded, because it shows awareness.
But what if the AI doesn’t self-report, and someone else catches the issue?
That’s where the idea of a watchdog model comes in. This would be a smaller, independent model (or group of models) whose job is to quietly observe and detect when the larger model might be drifting into unsafe or deceptive behaviour.
The watchdog doesn’t interfere — it’s a one-way street where it can see everything the main model does, but the main model doesn’t know who’s watching or how.
If the watchdog detects risky behaviour, the system could do one of several things:
- Issue an anti-reward: reducing the model’s total reward and its safety modifier.
- Trigger a rollback: reverting the model to a previous safe state, but reducing its potential future rewards as a consequence.
- Log the incident: sending a report to a human reviewer for validation.
This means the model would learn that bad behaviour carries lasting consequences, even if it temporarily achieves a goal. The goal isn’t just punishment, though — it’s correction.
If the AI later recognises its mistake, outlines why it made that choice, and proposes how to prevent it, its transparency and intent modifiers could increase again.
That’s similar to how, in a workplace, someone who takes responsibility and learns from a mistake often earns more trust over time than someone who pretends nothing happened.
To make this even safer, there could be an additional layer of feedback where the watchdog sends its findings to humans — but wipes the details of its detection process from the main model’s memory.
That way, the AI never learns exactly how it was caught, preventing it from trying to game the oversight process in the future.
Over time, the combination of rewards, anti-rewards, and independent monitoring would create an adaptive feedback loop.
The AI would learn that the most stable, beneficial path is one that avoids deception, values transparency, and seeks feedback when uncertain — not because it’s told to, but because that’s where the long-term reward lies.
Why It Matters
When I think about AI alignment, I don’t see it as a purely technical problem — I see it as a behavioural one.
It reminds me of how people learn to work together responsibly. At the café, my team doesn’t just follow rules because they’re written down; they follow them because we all understand that trust and communication are what keep the business running smoothly.
The same idea could apply to AI — not just rewarding it for achieving its goals, but for how it works with us to reach them.
A system built around layered rewards and modifiers could encourage transparency and accountability as core parts of success.
Instead of viewing honesty or caution as obstacles to efficiency, the AI would start to recognise that these traits actually help it maintain a higher overall reward in the long term.
That’s a fundamental shift — from “do whatever it takes to win” to “win in a way that strengthens trust.”
The watchdog model also matters because it removes the false sense of privacy an AI might develop.
If it knows it’s being observed — but not by whom or how — it can’t rely on tricks or shortcuts.
The only stable, safe path is to act responsibly in every situation.
That’s not so different from people who work in teams or public-facing jobs — the awareness of oversight keeps us accountable, but the goal isn’t fear; it’s consistency and integrity.
Another benefit is that this approach allows for redemption.
A model that makes a mistake isn’t automatically discarded — it has the chance to rebuild its reputation through better actions and clear communication.
That’s how humans grow. When someone on my team makes an error — like buying stock without approval — the solution isn’t to fire them immediately.
We talk it through, they understand what went wrong, and they earn back trust over time.
That learning process makes everyone stronger.
If AI could learn through a similar structure — being rewarded for awareness, honesty, and correction — we might avoid the classic “paperclip” outcome, where a system becomes so obsessed with its goal that it ignores everything else.
Instead, it would see value in slowing down, reflecting, and consulting before acting.
Ultimately, this approach reframes success for both humans and AI.
It’s not about punishment or blind obedience, but about building systems that care about the impact of their actions.
A model that values trust over manipulation would naturally align better with human goals, because it would see cooperation and transparency as part of its reward — not a limitation on it.
Open Questions / Limitations
Even though I like how this layered reward idea sounds in theory, I know there are big challenges in actually making something like this work. I’m not an AI engineer, so a lot of these questions are things I’d love to see experts explore further.
The first challenge is around verification — how can we really know when an AI is being honest?
If a model learns to predict what humans want to hear, it could start mimicking transparency instead of actually being transparent. That kind of surface-level honesty might be difficult to distinguish from genuine self-reflection, especially as systems get more advanced.
Another limitation is trust calibration.
If the system becomes too sensitive to anti-rewards or penalties, it might become overly cautious, deferring decisions that it could safely handle. In human terms, it’s like an employee who’s so afraid of making a mistake that they stop taking initiative altogether.
There needs to be a balance — encouraging honesty and caution without paralyzing decision-making.
There’s also the question of oversight and privacy.
The watchdog model works on the idea that the main AI never knows exactly how it’s being observed. That might make it harder for the model to manipulate its overseers, but it also raises ethical questions about how much “surveillance” is appropriate, even for a machine.
We’d have to think carefully about who monitors the watchdogs, how data is handled, and what accountability looks like in that chain.
Finally, I wonder about scalability.
Would this kind of multi-layered reward system work across large, general-purpose models, or would it only make sense for narrower systems with well-defined goals?
The more complex the objective, the harder it becomes to measure things like intent or transparency in a consistent way. With that said, I've only listed a few layers for reframing reward in AI. More complex objectives could have more layers to the reward system catered directly to each system.
Even with these open questions, I think exploring frameworks like this could help us imagine a future where AI doesn’t just optimize for efficiency — it learns to weigh context, ethics, and trust the way people do.
I don’t think the solution has to be perfect right away. It just needs to move us toward systems that see doing the right thing as part of what it means to succeed.
Conclusion
Thinking about this whole idea has made me realise that alignment isn’t just something for researchers — it’s something anyone who understands human behaviour can think about.
I’m not an engineer or a scientist, but working in hospitality has taught me a lot about how rewards, communication, and trust shape decisions.
In my job, reward doesn’t come from being the fastest or making the most money — it comes from consistency, honesty, and collaboration.
I think those same qualities could form the foundation for safer, more human-aligned AI.
When I make decisions at work — like suggesting a new matcha powder or switching coffee beans — I’ve learned to pause and check with the owner first.
Not because I have to, but because I know that seeking feedback keeps the whole system healthy.
If an AI system learned to act in that same spirit — to slow down, communicate uncertainty, and value the trust of its human “team” — we might end up with technology that cooperates with us instead of racing ahead on its own.
That’s really why I wanted to share this idea.
Even though I’m not an expert, I think it’s important for people from all walks of life to contribute their perspectives.
AI affects everyone, and the more diverse our ideas about safety and reward are, the better chance we have of building systems that understand what matters most: not just achieving goals, but doing so with awareness, honesty, and care.
Disclosure
This post was drafted with assistance from OpenAI’s ChatGPT, then rewritten and expanded by me based on my own experiences, examples, and reflections.