jmcontreras — LessWrong

My startup, Aymara, recently published a quantitative analysis of how 20 LLMs perform across 10 real‑world safety domains—ranging from misinformation and hate speech to malicious misuse. We released our findings as an arXiv preprint, complemented by a blog post and press coverage.

Our work currently compares models using a unified set of safety policies, but we’re now exploring the idea of rerunning the analysis using each provider’s own policies to audit alignment with their stated standards.

I’d love to hear if anyone else is working on auditing LLMs against their own policies!

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments