'I get surrounded by small ugh fields that grow into larger, overlapping ugh fields until my navigation becomes constained and eventually impossible' was how I described one such experience
A Sketched Proposal for Interrogatory, Low-Friction Model Cards
My auditor brain was getting annoyed by what I see the current state of model cards as being. If we adopt better norms about these proactively, this seems like low effort to moderately good payoff? I am unsure on this, hence, rough draft below.
Problem
Model cards are uneven: selective disclosure, vague provenance, flattering metrics, generic limitations. Regulation (EU AI Act) and risk frameworks (NIST AI RMF) are pushing toward evidence-backed docs, but most “cards” are still self-reported. If we want to close evals gaps and make safety claims more falsifiable, model card norms are an obvious lever
Design Goal
A modern, interrogatory model card that:
- Minimizes authoring friction
- Maximizes falsifiability/comparability
- Fits EU/NIST governance
- Works for open/closed models
Lever: machine-readable schema + sharp prompts + links to evidence (metrics/datasets/scripts), not essays. Fine tune requirements in further versions.
CAN • SHOULD • MUST
MUST
- Identity & lineage (machine-readable)
- Intended & out-of-scope uses (≥3 concrete unsafe uses + rationale)
- Performance claims → dataset version + eval script commit + run hash (subgroup metrics if relevant)
- Limitations & failure modes (worst-case behaviors tried; at least one “worse than baseline” context)
- Data provenance summary (classes/volumes/filters; link Data Cards if possible)
- Safety testing overview (red-team/evals scope: jailbreaks, autonomy, persuasion, cyber, bio)
- Operational constraints + changelog (rate/context limits, moderation, update policy)
SHOULD
- One “executable” eval (small, reproducible subset)
- Map to NIST AI RMF or EU AI Act obligations
- Short System Card if shipped in a product (HITL, retention, deployment affordances)
- Bias/fairness subgroup rationale (why these; what’s missing)
CAN
- Third-party attestations (verify a slice of claims)
- Public card score (completeness/candor)
- Living card (auto-update + diff feed)
Low-Friction Rationale
Schema-first, link artifacts over prose, reuse compliance docs you already produce.
Rough time: 4–10 hrs initial (10–20 if backfilling subgroup metrics); ~1 hr per update.
Doing this establishes norms that are not burdensome and leave compliance gaps more evident
Adoption Pathways
Venues (MUST + one SHOULD), model hubs (completeness badge), procurement (map to EU/NIST), community norms (reward executable claims), standards (you must have this to upload)
Next Steps (repo plan)
- v0.2: add JSON Schema and a minimal example card; CI to validate examples
- v0.3: sharpen interrogatory prompts; add HF-friendly README template
- v0.4: collect feedback, add third-party attestation hooks
Questions
What loopholes or blind spots do you see in this CAN/SHOULD/MUST split?
Am this being interrogatory enough?
How would you game this framework if it was a norm and you were adversarially capabilities-motivated?
You'd run into cognitive overhead limits. Manually reviewing other people's conversations can only really happen at 1:1 to 2:1 speeds. Summaries are much more efficient.
Plus, people behave very differently in radically observed environments. Panopticons were designed as part of a punishment system for a reason.
my project DRI starter kit
DRI = Directly Responsible Individual. Often a / the Project Manager, but not always!
By 'graceful', do you mean morally graceful, technically graceful, or both / other?
Thanks for the write-up, I recall a conversation introducing me to all these ideas in Berkeley last year and it's going to be very handy having a resource to point people at (and so I don't misremember details about things like the Yamanaka factors!).
Am I reading the current plan correctly such that the path is something like:
Get funding -> Continue R+D through primate trials -> Create an entity in a science-friendly, non-US state for human trials -> first rounds of Superbabies? That scenario seems like it would require a bunch of medical tourism, which I imagine is probably not off the table for people with the resources and mindset willing to participate in this.
I'm not sure that this mental line of defence would necessarily hold, us humans are easily manipulated by things that we know to be extremely simple agents that are definitely trying to manipulate us all the time: babies, puppies, kittens, etc.
This still holds true a significant amount of the time even if we pre-warn ourselves against the pending manipulation - there is a recurrent meme of, eg, dads in families not ostensibly not wanting a pet, only to relent when presented with one.
This implies your timelines for any large impact from AI would span multiple future generations, is that correct?
Dropping my plans of earning to give, which only really made sense before the recent flood of funding and the compression of timelines.
Increasing the amount of study I'm doing in Alignment and adjacent safety spaces. I have low confident I'll be able to help in any meaningful fashion given my native abilities and timelines, but not trying seems both foolish and psychologically damaging.
Reconsidering my plans to have children - it's more likely I'll spend time and resources on children already existing (or planned) inside my circle of caring.
Meta: I have been burrowed away in other research but came across these notes and thought I would publish them rather than let them languish. If there are other efforts in this direction, I would be glad to be pointed that way so I can abandon this idea and support someone else's instead.