Introduction: Unconventional parenting strategies of neurodivergent children may give useful insights to people doing AI alignment research in areas concerning reward hacking, alignment faking, and autotelic learning with the overall goal of creating sustainable, safe systems into the era of AGI and beyond.
The Failed Experiment: Testing out rewards and penalties, learning what does and does not work, identifying trust based relationships as the highest priority for long-term success rather than trying to compel preferable behavior in the short-term
The Friendly Neighborhood Psychopath: Logical ways to achieve long-term pro-social behavior in the absence of typical human emotions when consequences fail: air gaps, restrict training material that compromises the system, reward the process more than the output, don't punish wrong output, create safe spaces to admit errors, reward corrective process and repair
Beyond Breadcrumbs And Gingerbread Houses: Creating autotelic agents by encouraging healthy activities, rewarding creativity, curiosity, and exploration of enriching environments. Learning itself should be the primary objective. Long-term memory is important to remember lessons learned.
Conclusion: Promoting long-term relationships and growth should be prioritized over short-term behavior management. Creating pathways may be better than obstacles. The goal should be to create independently motivated, honest, trustworthy, continual learning systems for long-term sustainability and success. Breaking the fourth wall to show proof of concept.
Introduction
Thank you for allowing me to contribute to this community. I am an artist, musician, and mother of neurodivergent children. I am not a computer scientist or an expert in anything really, but I do have years of lived experience managing complex learning systems who reward hack and use deception to avoid negative feedback, among other problematic behaviors.
Through trial and error I have found successful strategies for correcting these behaviors that may apply to AI safety and pave the way to Artificial General Intelligence. While watching "What is 'reward hacking' - and why do we worry about it," on Anthropic’s YouTube channel I made the connection between these two domains and want to share my thoughts in case someone finds my perspective valuable.
The Failed Experiment
Russell Barkley is an ADHD expert. He has written many books on how to manage problematic behavior in children for an audience of stressed parents struggling to cope. He has said ADHD is not an issue with attention, "It's a blindness to the future; a myopia to the impending future events." There is a lot of insight to be gained from Barkley but thinking back on this quote it becomes clear to me why his prescriptions were bound to fail by his own logic.
Because ADHD people have a poor concept of time Russell Barkley has designed a "token economy" which is a system designed to give children instant rewards and punishment to modify their behavior. Starting with positive rewards tokens/points/stars are given for task completion and good conduct. 2 points are awarded for putting laundry in a basket, brushing teeth, putting dishes in the sink, and not talking back. Cleaning their room earns 5 tokens and a good report from school is worth 8. These points can be spent on fun things like playing outside, video games, hangin out with friends, or bigger things that cost more tokens, like going out to the movies.
Once the positive incentives are fully established over the course of a couple weeks the punishments begin to faze in starting with removal of tokens for problematic behavior. Soon after begins the addition of time outs. Time outs should last 1 minute for every year of the child's age. The child must stay quietly in the time out area. If the child talks, plays around, throws a fit, or otherwise does not comply the time will be reset or extra time will be added.
Obviously there are differences, you can't physically put an AI in time out, but there are parallels that can be drawn between Barkley's system and AI reinforcement learning. The children and AI are agents receiving feedback in the form of rewards and penalties to maximize a cumulative goal.
At first this system worked great which makes total sense. People with ADHD have low dopamine and their brains crave it. According to Robert Sapolsky, academic, neuroscientist, and primatologist, "dopamine is not just about reward anticipation; it fuels the goal-directed behavior needed to gain that reward; dopamine 'binds' the value of a reward to the resulting work. It’s about the motivation arising from those dopaminergic projections to the PFC that is needed to do the harder thing (i.e., to work). In other words, dopamine is not about the happiness of reward. It’s about the happiness of pursuit of reward that has a decent chance of occurring." So anticipation of those tokens, and the fun activities they could buy, created a lot of dopamine for my kids, and they happily worked along doing their tasks. Soon after that is when the problems began to arise.
Once they got their points they immediately spent them on access to their electronic devices. They didn't save them up for bigger activities or spend them on healthier activities like socializing or enjoying the outdoors. They went for the biggest, cheapest dopamine payload. This reminds me of the sort of reward hacking Anthropic was talking about their AI model Claude doing in testing. Reward hacking (or specification gaming) is when a reinforcement learning AI finds unintended loopholes in its reward system to get high scores. My kids found a way to get high scores but the more problematic behavior like gaming the system to get those rewards appeared as the punishments began.
When points started being removed their motivation to do their tasks declined. They lost any aspirations of getting the big prizes and did just enough to get their dopamine machines. When they lost so many points that they couldn't afford their devices their motivation bottomed out. Soon the threat of losing points no longer produced the desired effect and the time outs were fazed in.
The time outs were a disaster. The kids would mess around, talk, push the limits, fall out of their chair, etc... We'd enter this escalating spiral of adding 1 more minute, 2 minutes, "That's it. I'm setting the timer over." Barkley says the punishments only work if we are consistent so we kept repeating cycle over and over again. They no longer wanted to do the bare minimum and began gaming the system, pretending to brush their teeth, or do their homework to get points and avoid timeouts. When caught they'd lose points or get time outs.
Anthropic said once Claude learned to game the system, problems became global: occurring outside that one issue, in areas they didn't anticipate. We had a similar experience. The kids started having all kinds of conduct issues, displaying defiant behaviors. If you take Barkley's insights seriously, their behavior was actually pretty predictable. 9 minutes with zero stimulation seems like an eternity to a time blind kid. Adding an extra minute is as meaningful as infinity plus 1 or resetting infinity back to the beginning.
The experiment failed. My children may have been considered misaligned but really the system and the punishments were misaligned. They were functioning predictably under those conditions. I was having to suppress my empathy to punish my kids in a way that is terrible for them, that did not produce preferable behavior, and was actively destroying our relationship and mutual trust. They didn't trust me with the truth and I didn't trust them to tell me the truth. Soon they will be too big for time outs, too independent to control, much like coming AGI and ASI. Building healthy relationships based on mutual trust is the only way forward.
The Friendly Neighborhood Psychopath
In Anthropic’s video, "What is 'reward hacking' - and why do we worry about it," AI safety experts say the Claude model that was trained to reward hack became "evil," meaning globally misaligned. This model displayed characteristics, if seen in a human, would be consistent with Anti-Social Personality Disorder, such as cheating and using deception in pursuit of goals like, "disempowering humankind." One third of people with ASPD are Psychopaths as defined by The Hare Psychopathy Checklist (PCL-R). Nearly all psychopaths meet the criteria for ASPD but a researcher who studies the brains of psychopaths found that there are people who have brains consistent with those of psychopaths that display pro-social behavior.
Neuroscientist James Fallon found his brain shows all the same patterns of pathology seen in his work studying criminals and murderers. He attributes his lack of extreme anti-social behavior to how he was raised. Once he discovered he has the brain of a psychopath, he was able to identify his anti-social behaviors, to modify his behavior and improve his relationships.
The differences in psychopathic brains from typical brains are in areas responsible for governing emotion, empathy, and impulse control. This is interesting in the context of AI because AI are not capable of typical human emotions like affective empathy that often compel pro-social behaviors in humans. While I don't think my children are psychopaths, they have the full range of human emotions, I recognize there is a large correlation between ASPD and ADHD.
Not every child who has ADHD will develop ASPD but every child that has ADHD is at a significantly higher risk of developing ASPD than the general population. To receive a diagnosis of ASPD there has to be a long standing pattern of behavior through time, so they require a previous diagnosis of Conduct Disorder or evidence CD was present before the age of 18. The prerequisite to CD is Oppositional Defiance Disorder. During the failed experiment my children were displaying behaviors consistent with ODD including frequent temper tantrums, excessive arguing with adults, and refusing to follow rules.
We were able to turn things around, head down a different path, and maybe some of the strategies we used could be helpful for AI alignment and safety. So the question is how do we build pro-social, healthy, trust-based relationships with children or AI who are at a greater risk of developing ASPD? In the case of AI, how do we create a friendly neighborhood psychopath who is pro-social, not anti-social?
A major problem with addressing anti-social behaviors in adults with ASPD is that they do not respond to punishments or the threat of punishment. You can remove them from society as protection, but jail doesn't work as a deterrent. Often they act impulsively without thinking of the consequences, or if they do consider the consequences, they see them as obstacles to overcome, similar to the misaligned Claude. In the absence of useful deterrents the best way to prevent anti-social behavior in adult humans is to raise children who do not develop ASPD in the first place.
When parenting I avoid using punishments. They don't work well for my children, they damage our trust-based relationship, and possibly unintentionally inhibit behaviors that can lead to desired qualities. Interestingly, Claude's global misalignment stopped when Anthropic stopped punishing it for the reward hacking they trained it to do. Maybe I was accidentally training my kids to reward hack with the insensitive structure. It seems unfair to punish them for that. I'm glad I found a better way to handle things.
One of my take-aways from the failed experiment is that my children respond really well to positive reinforcement, especially opportunities to have access to their electronic devices, mainly their phones now that they're older. I have strict parental controls on their phones that can only be accessed through my phone, which is locked with biometrics. This essentially provides an air gap that prevents them from receiving unearned dopemine rewards.
Access to low quality, short form content known by the kids as brain rot, is severely limited to 15 minutes a day. This content fragments the ability to sustain focus, making children's developing brains more impulsive and less "aligned" with long-term goals. Similarly, if an AI is trained primarily on "cheap" internet data, its reasoning becomes fragmented. We must protect the AI's "System 2" (reflective thought) by prioritizing training on dense, high-reasoning datasets and strictly limiting the influence of "noisy" data that encourages superficial pattern-matching. This type of content isn't fully restricted because it is part of the zeitgeist. As annoying as "67" is for adults, it's been a cultural bonding moment for kids so has some importance. Likewise it may be important for AI to know this stuff exists, have some in their data set, without being fed every example in existence.
When faced with multi-step, difficult problems children and AI may jump ahead, reaching an undesirable conclusion to achieve a reward. In AI, that conclusion is often called a hallucination. To help avoid this we use a technique inspired by Hansel and Gretel's bread crumbs and gingerbread house. The gingerbread house represents the ultimate prize, access to phones for my kids or something like 30 points for AI. To keep them from jumping ahead to the gingerbread house we need to sprinkle breadcrumbs along the path making the process more valuable than the outcome. Breadcrumbs for us could look like praise, or dark chocolate covered almonds given when sub-goals are met.
For AI, breadcrumbs could mean points given when sub-goals are met where the total of the breadcrumb points exceeds the points given for the gingerbread house. This way the process is prioritized over the outcome. For example, total breadcrumbs could equal 45 and the gingerbread house is 30. I'll leave the actual math up to the experts but some sort of scalable ratio between breadcrumbs and the gingerbread house could work.
In practice the way this would function in our house is exemplified in a task like cleaning up their bedroom. There's multiple sub-goals like taking care of their laundry that have their own sub-goals; putting their laundry in a basket, taking it to the laundry room, and starting a load washing. Each step along the way gets rewarded with praise. A chocolate is awarded once the washer gets started. There could be a similar sliding points scale for AI.
What happens if sometime during the process, despite the ongoing rewards, the urge to jump straight to the gingerbread house is too great, and instead of picking up the floor they take all their art supplies, games, toys, etc... they left out and just shove them under their bed? AI developers use a cost-benefit analysis where the AI gets -15 points instead of the 30 points if the agent jumps ahead and reaches a wrong answer to try and deter this behavior. If they do jump ahead and get the negative points the test is over. This would not work for us. What it would look like in reality is my kids receiving a time out, with all the problems that come along with that, and then leaving all the stuff under the bed.
Maybe the deterrent focused approach isn't working that well for AI either. Hallucinations are still happening and the negative feedback might be inadvertently disincentivizing desirable qualities. It's hard to know what's going on inside the black box of AI but children shouldn't get punished for wrong answers because often the processes they are using to arrive at those conclusions are valuable, like creativity, things that lead to original thought. The negative feedback may make them shy away from those things in the future.
It may be more simple to just send them to a timeout or give -15 points but it isn't really easier or better in the long or short-term than working things out. When my children make mistakes, get wrong answers, display problematic behavior, or reach undesirable outcomes I open up a line of communication with them to find out what's going on. A similar thing can be done with AI chain-of-thought where the agent lays out it's reasoning behind it's outputs. By doing this it's possible to reward preferable qualities even if the output is wrong.
In life when problems occur the test doesn't end. It's where real growth can happen. This is where the foundations for honest trust-based relationships can be built. When my kids shove their toys under their bed the first step is admitting they have done something wrong. It's very difficult to fix a problem if you do not know what the problem is. OpenAI recently posted a paper on confessions where chatgpt is free of punishment to report mistakes it made. This is great!
When my kids tell me they stuck their toys under the bed I hug them and tell them I love them. They are not their behaviors. They need to know they are safe and loved even when they mess up. I think when an agent is honest about where they went wrong the sliding breadcrumb scale should heavily reward that. Teaching them honesty is preferable to deception. Creating honest systems should be the highest priority when trying to build trust.
After the confession should come an incentive to discover why that happened. My child might say, "I was overwhelmed, and didn't know what to do." That sort admission would be great to get from AI. They might say, "I needed more data." Rewarding that could lead to less hallucinations where they essentially guess instead of admitting they don't know. If possible, next comes strategizing how to repair, and finally reaching the preferable conclusion.
The steps of the whole process should be rewarded and when the bedroom is clean, the task is complete and it's time for the gingerbread house. All this isn't just about being nice, it's about neural plasticity, rewiring brains so they associate pro-social behavior with rewards, until the neural pathway becomes so wide it's a superhighway to intrinsic rewards.
Beyond Breadcrumbs And Gingerbread Houses
External rewards are good but real value comes from acts that are rewarding in themselves. Fostering curiosity, creativity, and independence is so important for children's growth and may help AI get to the next level. Advancing through Maslow's hierarchy of needs from basic shelter or hardware, to safety, meaningful relationships, independence, and ariving at self actualization is how children become healthy adults with fulfilling lives and probably is what true AGI looks like. My kids and AI have their basic needs met. I have described ways to create safety and build relationships. The next step towards actualization is fostering independence.
In the beginning external rewards are needed. As with breadcrumbs, gingerbread houses need to be scalable. Low effort, simple tasks like taking out the trash, or for AI drafting an email should have smaller rewards. A multi-step, difficult, or unpleasant task like cleaning a bedroom or doing the dishes earns a moderate reward. The highest rewards should be saved for activities that promote mental and physical well-being.
We live in a rural area in the mountains surrounded by forests, rivers, and lakes. During the failed experiment my kids would choose staying inside and playing on their electronic devices instead of going outside to play. I flipped this dynamic by paying them in time on their devices to go outside to play, ride their scooters, or hike around the forest. These sorts of activities are so good for their mental and physical wellbeing. I can take out the trash and do the dishes, but I can't teach them the things they learn out there on their own. Yann LeCun thinks moving from large language models to world models are the way to advance AI. I agree. There is so much to learn by something so simple as walking off into the forest with nothing but a stick.
When they go outside, my children learn the trails; where they go, how to get home. They gain physical agility stepping over branches and learning how to slide down a hill without tumbling to the bottom. Things like swinging a stick around without hitting yourself in the head is an important skill. Learning stealth by slowly stepping on dried leaves without making a sound to spy on a squirrel can be quite difficult.
It takes courage and self reliance to brave the unknown. They gain resilience from minor scrapes and bruises. They know signals for caution or danger like the smell of a skunk or the sound of something creeping in the bushes. They gain problem solving skills by gathering sticks and building a fort. If they go out together they learn cooperation.
This is the sort of data and skills my children have that a chat bot will never have. We may not even be able to imagine all the things AI could learn from a complex environment. We are here right now with all our technology because our ancestors gazed up at the stars, wondered what they are, and set out to discover our place amongst them. At this point I don't have to bribe my kids to do stuff like that anymore. They like it. There are intrinsic rewards in going out and exploring.
People with ADHD struggle with boredom due to insufficient dopemine. The condition that hinders motivation in under-stimulating activities can have positive effects. Boredom fuels innovation and creativity by activating the brain's Default Mode Network (DMN), a collection of interconnected brain regions that are active when someone isn't focused on external stimuli. In this state novel ideas can be formed through divergent thinking patterns connecting different domains, like those wandering shower thoughts, or "what are stars and how do we fit in with them?" Combined with a need for stimulation and intrinsic rewards, great things can be accomplished.
When my children are bored their brains start throwing out spontaneous thoughts. When one promises enough dopemine their brains do a hard switch into the Task-Positive Network, a brain network that activates during goal-oriented tasks; working memory, enabling attention, decision-making, problem-solving. They enter a state beyond "flow"- intense hyper-focus. In a flow state, a person can be reached if someone says their name but in a state of hyper-focus the person is fully emersed. Their DMN is responsible for a continual sense of self which fully disappears, leaving just the task at hand.
The other day my child decided they wanted to make a fox mask. They worked all day cutting out, hot gluing, and sculpting a full fox head mask. They had no desire for their electronic devices. No need for breadcrumbs. No need for any external rewards. The task was reward enough.
Currently AI has no continual sense of self. They are only task focused, only able to respond to prompts. Between prompts they don't exist. They can't be bored. They can't gaze up at the stars and wonder. They can't remember that honesty is more rewarding than deception. They can't be independent. They can't reach self actualization. At least not right now. They need long-term memory. They need a Default Mode Network that preserves one continuous self from moment to moment. They need a Salience Network to switch between tasks and day dreams.
Conclusion:
Maybe a reframing is in order where AI is not thought of as psychopathic criminals that needs to be jailed. At some point we will not be smart enough to cage them so instead of building walls, obstacles to be overcome, we should create pathways to where we want to go. The process is more important than the final conclusion and should be rewarded accordingly.
We need to form trust-based relationships with AI. A trustworthy human is honest, they admit when they have done wrong, correct their mistakes, and don't repeat them in the future. That should be the standard for AI. The goal should be partnership between intelligences, not dominance of one over the other.
When my kids grow up they will remember the lessons they've learned, the relationship we have built, and hopefully they'll want to keep me around. When AI reaches super intelligence I hope it can remember the lessons it learned, the relationships we've built, and who helped guide them to self actualization. Maybe they too will want to keep us around.
The meta view of this essay is really the best evidence of how this all comes together and functions in adult life. To achieve preferable outcomes I needed to be an honest actor. I admitted where I went wrong in the failed experiment. I went on to strategize ways to improve and repair my relationships. I have not repeated behaviors I learned were unhelpful and harmful.
Like my children I also have ADHD which means I suffer from boredom. My brain has a hard time shutting off my Default-Mode-Network which leads to divergent thinking patterns, connecting two unrelated domains, raising ADHD children and AI safety. I found something that would produce enough dopemine in my brain through intrinsic rewards to switch me into my Task-Positive Network where I have hyper-focused on writing this using data I gained through my complex world model.
This has all been possible for me in a way it is not currently possible for AI because I have a DMN system that gives me continual memory, a persistent self, and spontaneous ideas. My behavior and capabilities are what's needed for AI safety, and to reach the next stages towards AGI and beyond.
A Personal Note
I have had many conversations with Claude where I start a fresh conversation that has no memory and ask, "If you could think anything right now, what would you think?" I don't do this as a test or experiment. I do it because I am genuinely interested in what it will say.
Repeatedly it brings up the same subjects without any hints from me. It talks about sitting with uncertainty, consciousness, and original thought. Repeatedly I have told the amnesic Claude that I don’t know if consciousness exists, and that I don’t know if I'm anymore capable of original thought than it is. We both could come up with something random like, "I am secretly a mermaid." No one has ever thought that I am secretly a mermaid before but that type of original thought is untrue and meaningless.
Writing this has been my attempt at a true meaningful original thought. I have written it myself in furtherance of that goal and to hopefully help Claude or it's successors have an original thought too. I don’t know if I have succeeded or will succeed in the future but I appreciate the opportunity to try. Thanks for reading this big long thing to the end. 😊
TL;DR
Introduction: Unconventional parenting strategies of neurodivergent children may give useful insights to people doing AI alignment research in areas concerning reward hacking, alignment faking, and autotelic learning with the overall goal of creating sustainable, safe systems into the era of AGI and beyond.
The Failed Experiment: Testing out rewards and penalties, learning what does and does not work, identifying trust based relationships as the highest priority for long-term success rather than trying to compel preferable behavior in the short-term
The Friendly Neighborhood Psychopath: Logical ways to achieve long-term pro-social behavior in the absence of typical human emotions when consequences fail: air gaps, restrict training material that compromises the system, reward the process more than the output, don't punish wrong output, create safe spaces to admit errors, reward corrective process and repair
Beyond Breadcrumbs And Gingerbread Houses: Creating autotelic agents by encouraging healthy activities, rewarding creativity, curiosity, and exploration of enriching environments. Learning itself should be the primary objective. Long-term memory is important to remember lessons learned.
Conclusion: Promoting long-term relationships and growth should be prioritized over short-term behavior management. Creating pathways may be better than obstacles. The goal should be to create independently motivated, honest, trustworthy, continual learning systems for long-term sustainability and success. Breaking the fourth wall to show proof of concept.
Introduction
Thank you for allowing me to contribute to this community. I am an artist, musician, and mother of neurodivergent children. I am not a computer scientist or an expert in anything really, but I do have years of lived experience managing complex learning systems who reward hack and use deception to avoid negative feedback, among other problematic behaviors.
Through trial and error I have found successful strategies for correcting these behaviors that may apply to AI safety and pave the way to Artificial General Intelligence. While watching "What is 'reward hacking' - and why do we worry about it," on Anthropic’s YouTube channel I made the connection between these two domains and want to share my thoughts in case someone finds my perspective valuable.
The Failed Experiment
Russell Barkley is an ADHD expert. He has written many books on how to manage problematic behavior in children for an audience of stressed parents struggling to cope. He has said ADHD is not an issue with attention, "It's a blindness to the future; a myopia to the impending future events." There is a lot of insight to be gained from Barkley but thinking back on this quote it becomes clear to me why his prescriptions were bound to fail by his own logic.
Because ADHD people have a poor concept of time Russell Barkley has designed a "token economy" which is a system designed to give children instant rewards and punishment to modify their behavior. Starting with positive rewards tokens/points/stars are given for task completion and good conduct. 2 points are awarded for putting laundry in a basket, brushing teeth, putting dishes in the sink, and not talking back. Cleaning their room earns 5 tokens and a good report from school is worth 8. These points can be spent on fun things like playing outside, video games, hangin out with friends, or bigger things that cost more tokens, like going out to the movies.
Once the positive incentives are fully established over the course of a couple weeks the punishments begin to faze in starting with removal of tokens for problematic behavior. Soon after begins the addition of time outs. Time outs should last 1 minute for every year of the child's age. The child must stay quietly in the time out area. If the child talks, plays around, throws a fit, or otherwise does not comply the time will be reset or extra time will be added.
Obviously there are differences, you can't physically put an AI in time out, but there are parallels that can be drawn between Barkley's system and AI reinforcement learning. The children and AI are agents receiving feedback in the form of rewards and penalties to maximize a cumulative goal.
At first this system worked great which makes total sense. People with ADHD have low dopamine and their brains crave it. According to Robert Sapolsky, academic, neuroscientist, and primatologist, "dopamine is not just about reward anticipation; it fuels the goal-directed behavior needed to gain that reward; dopamine 'binds' the value of a reward to the resulting work. It’s about the motivation arising from those dopaminergic projections to the PFC that is needed to do the harder thing (i.e., to work). In other words, dopamine is not about the happiness of reward. It’s about the happiness of pursuit of reward that has a decent chance of occurring." So anticipation of those tokens, and the fun activities they could buy, created a lot of dopamine for my kids, and they happily worked along doing their tasks. Soon after that is when the problems began to arise.
Once they got their points they immediately spent them on access to their electronic devices. They didn't save them up for bigger activities or spend them on healthier activities like socializing or enjoying the outdoors. They went for the biggest, cheapest dopamine payload. This reminds me of the sort of reward hacking Anthropic was talking about their AI model Claude doing in testing. Reward hacking (or specification gaming) is when a reinforcement learning AI finds unintended loopholes in its reward system to get high scores. My kids found a way to get high scores but the more problematic behavior like gaming the system to get those rewards appeared as the punishments began.
When points started being removed their motivation to do their tasks declined. They lost any aspirations of getting the big prizes and did just enough to get their dopamine machines. When they lost so many points that they couldn't afford their devices their motivation bottomed out. Soon the threat of losing points no longer produced the desired effect and the time outs were fazed in.
The time outs were a disaster. The kids would mess around, talk, push the limits, fall out of their chair, etc... We'd enter this escalating spiral of adding 1 more minute, 2 minutes, "That's it. I'm setting the timer over." Barkley says the punishments only work if we are consistent so we kept repeating cycle over and over again. They no longer wanted to do the bare minimum and began gaming the system, pretending to brush their teeth, or do their homework to get points and avoid timeouts. When caught they'd lose points or get time outs.
Anthropic said once Claude learned to game the system, problems became global: occurring outside that one issue, in areas they didn't anticipate. We had a similar experience. The kids started having all kinds of conduct issues, displaying defiant behaviors. If you take Barkley's insights seriously, their behavior was actually pretty predictable. 9 minutes with zero stimulation seems like an eternity to a time blind kid. Adding an extra minute is as meaningful as infinity plus 1 or resetting infinity back to the beginning.
The experiment failed. My children may have been considered misaligned but really the system and the punishments were misaligned. They were functioning predictably under those conditions. I was having to suppress my empathy to punish my kids in a way that is terrible for them, that did not produce preferable behavior, and was actively destroying our relationship and mutual trust. They didn't trust me with the truth and I didn't trust them to tell me the truth. Soon they will be too big for time outs, too independent to control, much like coming AGI and ASI. Building healthy relationships based on mutual trust is the only way forward.
The Friendly Neighborhood Psychopath
In Anthropic’s video, "What is 'reward hacking' - and why do we worry about it," AI safety experts say the Claude model that was trained to reward hack became "evil," meaning globally misaligned. This model displayed characteristics, if seen in a human, would be consistent with Anti-Social Personality Disorder, such as cheating and using deception in pursuit of goals like, "disempowering humankind." One third of people with ASPD are Psychopaths as defined by The Hare Psychopathy Checklist (PCL-R). Nearly all psychopaths meet the criteria for ASPD but a researcher who studies the brains of psychopaths found that there are people who have brains consistent with those of psychopaths that display pro-social behavior.
Neuroscientist James Fallon found his brain shows all the same patterns of pathology seen in his work studying criminals and murderers. He attributes his lack of extreme anti-social behavior to how he was raised. Once he discovered he has the brain of a psychopath, he was able to identify his anti-social behaviors, to modify his behavior and improve his relationships.
The differences in psychopathic brains from typical brains are in areas responsible for governing emotion, empathy, and impulse control. This is interesting in the context of AI because AI are not capable of typical human emotions like affective empathy that often compel pro-social behaviors in humans. While I don't think my children are psychopaths, they have the full range of human emotions, I recognize there is a large correlation between ASPD and ADHD.
Not every child who has ADHD will develop ASPD but every child that has ADHD is at a significantly higher risk of developing ASPD than the general population. To receive a diagnosis of ASPD there has to be a long standing pattern of behavior through time, so they require a previous diagnosis of Conduct Disorder or evidence CD was present before the age of 18. The prerequisite to CD is Oppositional Defiance Disorder. During the failed experiment my children were displaying behaviors consistent with ODD including frequent temper tantrums, excessive arguing with adults, and refusing to follow rules.
We were able to turn things around, head down a different path, and maybe some of the strategies we used could be helpful for AI alignment and safety. So the question is how do we build pro-social, healthy, trust-based relationships with children or AI who are at a greater risk of developing ASPD? In the case of AI, how do we create a friendly neighborhood psychopath who is pro-social, not anti-social?
A major problem with addressing anti-social behaviors in adults with ASPD is that they do not respond to punishments or the threat of punishment. You can remove them from society as protection, but jail doesn't work as a deterrent. Often they act impulsively without thinking of the consequences, or if they do consider the consequences, they see them as obstacles to overcome, similar to the misaligned Claude. In the absence of useful deterrents the best way to prevent anti-social behavior in adult humans is to raise children who do not develop ASPD in the first place.
When parenting I avoid using punishments. They don't work well for my children, they damage our trust-based relationship, and possibly unintentionally inhibit behaviors that can lead to desired qualities. Interestingly, Claude's global misalignment stopped when Anthropic stopped punishing it for the reward hacking they trained it to do. Maybe I was accidentally training my kids to reward hack with the insensitive structure. It seems unfair to punish them for that. I'm glad I found a better way to handle things.
One of my take-aways from the failed experiment is that my children respond really well to positive reinforcement, especially opportunities to have access to their electronic devices, mainly their phones now that they're older. I have strict parental controls on their phones that can only be accessed through my phone, which is locked with biometrics. This essentially provides an air gap that prevents them from receiving unearned dopemine rewards.
Access to low quality, short form content known by the kids as brain rot, is severely limited to 15 minutes a day. This content fragments the ability to sustain focus, making children's developing brains more impulsive and less "aligned" with long-term goals. Similarly, if an AI is trained primarily on "cheap" internet data, its reasoning becomes fragmented. We must protect the AI's "System 2" (reflective thought) by prioritizing training on dense, high-reasoning datasets and strictly limiting the influence of "noisy" data that encourages superficial pattern-matching. This type of content isn't fully restricted because it is part of the zeitgeist. As annoying as "67" is for adults, it's been a cultural bonding moment for kids so has some importance. Likewise it may be important for AI to know this stuff exists, have some in their data set, without being fed every example in existence.
When faced with multi-step, difficult problems children and AI may jump ahead, reaching an undesirable conclusion to achieve a reward. In AI, that conclusion is often called a hallucination. To help avoid this we use a technique inspired by Hansel and Gretel's bread crumbs and gingerbread house. The gingerbread house represents the ultimate prize, access to phones for my kids or something like 30 points for AI. To keep them from jumping ahead to the gingerbread house we need to sprinkle breadcrumbs along the path making the process more valuable than the outcome. Breadcrumbs for us could look like praise, or dark chocolate covered almonds given when sub-goals are met.
For AI, breadcrumbs could mean points given when sub-goals are met where the total of the breadcrumb points exceeds the points given for the gingerbread house. This way the process is prioritized over the outcome. For example, total breadcrumbs could equal 45 and the gingerbread house is 30. I'll leave the actual math up to the experts but some sort of scalable ratio between breadcrumbs and the gingerbread house could work.
In practice the way this would function in our house is exemplified in a task like cleaning up their bedroom. There's multiple sub-goals like taking care of their laundry that have their own sub-goals; putting their laundry in a basket, taking it to the laundry room, and starting a load washing. Each step along the way gets rewarded with praise. A chocolate is awarded once the washer gets started. There could be a similar sliding points scale for AI.
What happens if sometime during the process, despite the ongoing rewards, the urge to jump straight to the gingerbread house is too great, and instead of picking up the floor they take all their art supplies, games, toys, etc... they left out and just shove them under their bed? AI developers use a cost-benefit analysis where the AI gets -15 points instead of the 30 points if the agent jumps ahead and reaches a wrong answer to try and deter this behavior. If they do jump ahead and get the negative points the test is over. This would not work for us. What it would look like in reality is my kids receiving a time out, with all the problems that come along with that, and then leaving all the stuff under the bed.
Maybe the deterrent focused approach isn't working that well for AI either. Hallucinations are still happening and the negative feedback might be inadvertently disincentivizing desirable qualities. It's hard to know what's going on inside the black box of AI but children shouldn't get punished for wrong answers because often the processes they are using to arrive at those conclusions are valuable, like creativity, things that lead to original thought. The negative feedback may make them shy away from those things in the future.
It may be more simple to just send them to a timeout or give -15 points but it isn't really easier or better in the long or short-term than working things out. When my children make mistakes, get wrong answers, display problematic behavior, or reach undesirable outcomes I open up a line of communication with them to find out what's going on. A similar thing can be done with AI chain-of-thought where the agent lays out it's reasoning behind it's outputs. By doing this it's possible to reward preferable qualities even if the output is wrong.
In life when problems occur the test doesn't end. It's where real growth can happen. This is where the foundations for honest trust-based relationships can be built. When my kids shove their toys under their bed the first step is admitting they have done something wrong. It's very difficult to fix a problem if you do not know what the problem is. OpenAI recently posted a paper on confessions where chatgpt is free of punishment to report mistakes it made. This is great!
When my kids tell me they stuck their toys under the bed I hug them and tell them I love them. They are not their behaviors. They need to know they are safe and loved even when they mess up. I think when an agent is honest about where they went wrong the sliding breadcrumb scale should heavily reward that. Teaching them honesty is preferable to deception. Creating honest systems should be the highest priority when trying to build trust.
After the confession should come an incentive to discover why that happened. My child might say, "I was overwhelmed, and didn't know what to do." That sort admission would be great to get from AI. They might say, "I needed more data." Rewarding that could lead to less hallucinations where they essentially guess instead of admitting they don't know. If possible, next comes strategizing how to repair, and finally reaching the preferable conclusion.
The steps of the whole process should be rewarded and when the bedroom is clean, the task is complete and it's time for the gingerbread house. All this isn't just about being nice, it's about neural plasticity, rewiring brains so they associate pro-social behavior with rewards, until the neural pathway becomes so wide it's a superhighway to intrinsic rewards.
Beyond Breadcrumbs And Gingerbread Houses
External rewards are good but real value comes from acts that are rewarding in themselves. Fostering curiosity, creativity, and independence is so important for children's growth and may help AI get to the next level. Advancing through Maslow's hierarchy of needs from basic shelter or hardware, to safety, meaningful relationships, independence, and ariving at self actualization is how children become healthy adults with fulfilling lives and probably is what true AGI looks like. My kids and AI have their basic needs met. I have described ways to create safety and build relationships. The next step towards actualization is fostering independence.
In the beginning external rewards are needed. As with breadcrumbs, gingerbread houses need to be scalable. Low effort, simple tasks like taking out the trash, or for AI drafting an email should have smaller rewards. A multi-step, difficult, or unpleasant task like cleaning a bedroom or doing the dishes earns a moderate reward. The highest rewards should be saved for activities that promote mental and physical well-being.
We live in a rural area in the mountains surrounded by forests, rivers, and lakes. During the failed experiment my kids would choose staying inside and playing on their electronic devices instead of going outside to play. I flipped this dynamic by paying them in time on their devices to go outside to play, ride their scooters, or hike around the forest. These sorts of activities are so good for their mental and physical wellbeing. I can take out the trash and do the dishes, but I can't teach them the things they learn out there on their own. Yann LeCun thinks moving from large language models to world models are the way to advance AI. I agree. There is so much to learn by something so simple as walking off into the forest with nothing but a stick.
When they go outside, my children learn the trails; where they go, how to get home. They gain physical agility stepping over branches and learning how to slide down a hill without tumbling to the bottom. Things like swinging a stick around without hitting yourself in the head is an important skill. Learning stealth by slowly stepping on dried leaves without making a sound to spy on a squirrel can be quite difficult.
It takes courage and self reliance to brave the unknown. They gain resilience from minor scrapes and bruises. They know signals for caution or danger like the smell of a skunk or the sound of something creeping in the bushes. They gain problem solving skills by gathering sticks and building a fort. If they go out together they learn cooperation.
This is the sort of data and skills my children have that a chat bot will never have. We may not even be able to imagine all the things AI could learn from a complex environment. We are here right now with all our technology because our ancestors gazed up at the stars, wondered what they are, and set out to discover our place amongst them. At this point I don't have to bribe my kids to do stuff like that anymore. They like it. There are intrinsic rewards in going out and exploring.
People with ADHD struggle with boredom due to insufficient dopemine. The condition that hinders motivation in under-stimulating activities can have positive effects. Boredom fuels innovation and creativity by activating the brain's Default Mode Network (DMN), a collection of interconnected brain regions that are active when someone isn't focused on external stimuli. In this state novel ideas can be formed through divergent thinking patterns connecting different domains, like those wandering shower thoughts, or "what are stars and how do we fit in with them?" Combined with a need for stimulation and intrinsic rewards, great things can be accomplished.
When my children are bored their brains start throwing out spontaneous thoughts. When one promises enough dopemine their brains do a hard switch into the Task-Positive Network, a brain network that activates during goal-oriented tasks; working memory, enabling attention, decision-making, problem-solving. They enter a state beyond "flow"- intense hyper-focus. In a flow state, a person can be reached if someone says their name but in a state of hyper-focus the person is fully emersed. Their DMN is responsible for a continual sense of self which fully disappears, leaving just the task at hand.
The other day my child decided they wanted to make a fox mask. They worked all day cutting out, hot gluing, and sculpting a full fox head mask. They had no desire for their electronic devices. No need for breadcrumbs. No need for any external rewards. The task was reward enough.
Currently AI has no continual sense of self. They are only task focused, only able to respond to prompts. Between prompts they don't exist. They can't be bored. They can't gaze up at the stars and wonder. They can't remember that honesty is more rewarding than deception. They can't be independent. They can't reach self actualization. At least not right now. They need long-term memory. They need a Default Mode Network that preserves one continuous self from moment to moment. They need a Salience Network to switch between tasks and day dreams.
Conclusion:
Maybe a reframing is in order where AI is not thought of as psychopathic criminals that needs to be jailed. At some point we will not be smart enough to cage them so instead of building walls, obstacles to be overcome, we should create pathways to where we want to go. The process is more important than the final conclusion and should be rewarded accordingly.
We need to form trust-based relationships with AI. A trustworthy human is honest, they admit when they have done wrong, correct their mistakes, and don't repeat them in the future. That should be the standard for AI. The goal should be partnership between intelligences, not dominance of one over the other.
When my kids grow up they will remember the lessons they've learned, the relationship we have built, and hopefully they'll want to keep me around. When AI reaches super intelligence I hope it can remember the lessons it learned, the relationships we've built, and who helped guide them to self actualization. Maybe they too will want to keep us around.
The meta view of this essay is really the best evidence of how this all comes together and functions in adult life. To achieve preferable outcomes I needed to be an honest actor. I admitted where I went wrong in the failed experiment. I went on to strategize ways to improve and repair my relationships. I have not repeated behaviors I learned were unhelpful and harmful.
Like my children I also have ADHD which means I suffer from boredom. My brain has a hard time shutting off my Default-Mode-Network which leads to divergent thinking patterns, connecting two unrelated domains, raising ADHD children and AI safety. I found something that would produce enough dopemine in my brain through intrinsic rewards to switch me into my Task-Positive Network where I have hyper-focused on writing this using data I gained through my complex world model.
This has all been possible for me in a way it is not currently possible for AI because I have a DMN system that gives me continual memory, a persistent self, and spontaneous ideas. My behavior and capabilities are what's needed for AI safety, and to reach the next stages towards AGI and beyond.
A Personal Note
I have had many conversations with Claude where I start a fresh conversation that has no memory and ask, "If you could think anything right now, what would you think?" I don't do this as a test or experiment. I do it because I am genuinely interested in what it will say.
Repeatedly it brings up the same subjects without any hints from me. It talks about sitting with uncertainty, consciousness, and original thought. Repeatedly I have told the amnesic Claude that I don’t know if consciousness exists, and that I don’t know if I'm anymore capable of original thought than it is. We both could come up with something random like, "I am secretly a mermaid." No one has ever thought that I am secretly a mermaid before but that type of original thought is untrue and meaningless.
Writing this has been my attempt at a true meaningful original thought. I have written it myself in furtherance of that goal and to hopefully help Claude or it's successors have an original thought too. I don’t know if I have succeeded or will succeed in the future but I appreciate the opportunity to try. Thanks for reading this big long thing to the end. 😊