PipFoweraker — LessWrong

Your AI Safety org could get EU funding up to €9.08M. Here’s how (+ free personalized support) Update: Webinar 18/8 Link Below

PipFoweraker1mo40

This kind of post is awesome and too uncommon.

Helping people through operational bottlenecks or invisible stage-gates - like tailoring your application to suit your org's reputation / scale - is good metis and worth spreading.

PipFoweraker's Shortform

PipFoweraker3mo10

Meta: I have been burrowed away in other research but came across these notes and thought I would publish them rather than let them languish. If there are other efforts in this direction, I would be glad to be pointed that way so I can abandon this idea and support someone else's instead.

asher's Shortform

PipFoweraker3mo3113

'I get surrounded by small ugh fields that grow into larger, overlapping ugh fields until my navigation becomes constained and eventually impossible' was how I described one such experience

PipFoweraker's Shortform

PipFoweraker3mo10

A Sketched Proposal for Interrogatory, Low-Friction Model Cards

My auditor brain was getting annoyed by what I see the current state of model cards as being. If we adopt better norms about these proactively, this seems like low effort to moderately good payoff? I am unsure on this, hence, rough draft below.

Problem
Model cards are uneven: selective disclosure, vague provenance, flattering metrics, generic limitations. Regulation (EU AI Act) and risk frameworks (NIST AI RMF) are pushing toward evidence-backed docs, but most “cards” are still self-reported. If we want to close evals gaps and make safety claims more falsifiable, model card norms are an obvious lever

Design Goal
A modern, interrogatory model card that:
- Minimizes authoring friction
- Maximizes falsifiability/comparability
- Fits EU/NIST governance
- Works for open/closed models

Lever: machine-readable schema + sharp prompts + links to evidence (metrics/datasets/scripts), not essays. Fine tune requirements in further versions.

CAN • SHOULD • MUST

MUST
- Identity & lineage (machine-readable)
- Intended & out-of-scope uses (≥3 concrete unsafe uses + rationale)
- Performance claims → dataset version + eval script commit + run hash (subgroup metrics if relevant)
- Limitations & failure modes (worst-case behaviors tried; at least one “worse than baseline” context)
- Data provenance summary (classes/volumes/filters; link Data Cards if possible)
- Safety testing overview (red-team/evals scope: jailbreaks, autonomy, persuasion, cyber, bio)
- Operational constraints + changelog (rate/context limits, moderation, update policy)

SHOULD
- One “executable” eval (small, reproducible subset)
- Map to NIST AI RMF or EU AI Act obligations
- Short System Card if shipped in a product (HITL, retention, deployment affordances)
- Bias/fairness subgroup rationale (why these; what’s missing)

CAN
- Third-party attestations (verify a slice of claims)
- Public card score (completeness/candor)
- Living card (auto-update + diff feed)

Low-Friction Rationale
Schema-first, link artifacts over prose, reuse compliance docs you already produce.
Rough time: 4–10 hrs initial (10–20 if backfilling subgroup metrics); ~1 hr per update.
Doing this establishes norms that are not burdensome and leave compliance gaps more evident

Adoption Pathways
Venues (MUST + one SHOULD), model hubs (completeness badge), procurement (map to EU/NIST), community norms (reward executable claims), standards (you must have this to upload)

Next Steps (repo plan)
- v0.2: add JSON Schema and a minimal example card; CI to validate examples
- v0.3: sharpen interrogatory prompts; add HF-friendly README template
- v0.4: collect feedback, add third-party attestation hooks

Questions
What loopholes or blind spots do you see in this CAN/SHOULD/MUST split?
Am this being interrogatory enough?
How would you game this framework if it was a norm and you were adversarially capabilities-motivated?

How I've run major projects

PipFoweraker9mo11

You'd run into cognitive overhead limits. Manually reviewing other people's conversations can only really happen at 1:1 to 2:1 speeds. Summaries are much more efficient.

Plus, people behave very differently in radically observed environments. Panopticons were designed as part of a punishment system for a reason.

How I've run major projects

PipFoweraker9mo60

my project DRI starter kit

DRI = Directly Responsible Individual. Often a / the Project Manager, but not always!

Catastrophic sabotage as a major threat model for human-level AI systems

PipFoweraker9mo10

By 'graceful', do you mean morally graceful, technically graceful, or both / other?

How to Make Superbabies

PipFoweraker9mo102

Thanks for the write-up, I recall a conversation introducing me to all these ideas in Berkeley last year and it's going to be very handy having a resource to point people at (and so I don't misremember details about things like the Yamanaka factors!).

Am I reading the current plan correctly such that the path is something like:
Get funding -> Continue R+D through primate trials -> Create an entity in a science-friendly, non-US state for human trials -> first rounds of Superbabies? That scenario seems like it would require a bunch of medical tourism, which I imagine is probably not off the table for people with the resources and mindset willing to participate in this.

How it feels to have your mind hacked by an AI

PipFoweraker3y21

I'm not sure that this mental line of defence would necessarily hold, us humans are easily manipulated by things that we know to be extremely simple agents that are definitely trying to manipulate us all the time: babies, puppies, kittens, etc.

This still holds true a significant amount of the time even if we pre-warn ourselves against the pending manipulation - there is a recurrent meme of, eg, dads in families not ostensibly not wanting a pet, only to relent when presented with one.

How do AI timelines affect how you live your life?

PipFoweraker3y20

This implies your timelines for any large impact from AI would span multiple future generations, is that correct?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments