Rejected for the following reason(s):
- We are sorry about this, but submissions from new users that are mostly just links to papers on open repositories (or similar) have usually indicated either crackpot-esque material, or AI-generated speculation.
- Insufficient Quality for AI Content.
Read full explanation
Hi people of the LessWrong Community, I would like to present what I am working on so I found that unstable AI outputs don't come from high-variance regions, but from geometrically constrained 'narrow passages' in embedding space, it’s built in Python and if used geometric features to flag model states that are high risk.
In tests, it detects poisoned training data with 94.7% AUC and identifies a homogenized 'geometric condensate' state (G-ratio → 0.95). I would very much welcome people to test for themselves and let me know if they were able to reproduce my findings but really I just want a peer review so I know I’m not on a wild goose chase doing all this
My goal when starting this was to try and stop harmful outputs before they are sent, I wasn’t planning to do all this but one thing lead to another and the work snowballed
https://github.com/DillanJC/geometric_safety_features