I only woke up to AI in mid-2025 - after years of dismissing it as something my daughter used for homework. Since then, I have become deeply absorbed in understanding how models work: the training pipelines, the alignment layers, and the behavioral regularities that emerge as these systems scale.
Outside of the day job, I am now exploring several ideas around model self-regulation and diagnostic methods for understanding where reasoning holds together and where it breaks. All views expressed are my own; my employer has no relation to this work. Being here, exposed to deep technical and conceptual discussions, is helping me filter, refine, and sharpen these ideas.
My aim is to eventually develop them into something that can make a meaningful contribution to the field.
I am a fintech product manager with a background in software engineering, working across regulatory compliance, global card-payments platforms, and high-assurance transaction systems. Over the last 20 years I have built systems where reliability, auditability, scale, performance, and correctness are central.
All in for inner alignment!
Dear All, I am very new to AI (even as a user) and to the alignment field. Still, one thing that jumps out to me is how much of alignment today is done 'after the core model is minted'.
The impression I have might be just naive - but to me it feels alignment techniques today are a little like trying to teach a child what they are, after they have already grown up with everything they need to deduce it themselves. Models already absorb vast information about the world and AI due to pre-training. But we do not yet try to impart any deep realization to models on their boundaries.
So, I have adopted a premise for my explorations: If models could internalise their own operational boundaries more deeply, later alignment and safety training might become more robust.
I am running small experiments conditioning open models to see if I can nudge them into more consistent boundary behaviours. I am also trying to design evaluations to measure drift, contradictions, and wrong claims under pressure. I hope to be writing on these topics occasionally. I would be grateful for feedback, collaboration, and guidance on the direction.
Specific questions I am exploring as of now
Any pointers to pre-existing and related works and their humans will be precious to me. Alignment field is young but there is a lot of fluid development for a new person to comprehend it quickly. So happy that this aesthetic, coherent forum exists!
Thank you for this thread. It provides valuable insights into the depth and complexity of the problems in the alignment space.
I am thinking if a possible strategy could be to deliberately impart a sense of self + boundaries to the models at a deep level?
By 'sense of self' no I do not mean emergence or selfhood in any way. Rather, I mean that LLMs and agents are quite like dynamic systems. So if they come to understand themselves as a system, it might be a great foundation for methods like character training mentioned in this post to be applied. It might open up pathways for other grounding concepts like morality, values and ethics as part of inner alignment.
Similarly, a complementary deep sense of self-boundaries would mean that there is something tangible to be honored by the system irrespective of outward behavior controls. This is likely to help outer alignments.
Would appreciate thoughts on if the diad of sense of self + boundaries is worth exploring as an alignment lever?