Joshua Davis — LessWrong

Open Thread – Autumn 2023

Really appreciated this intro—your concerns map closely to mine. I’m Joshua; I work at the intersection of decentralized governance, legal advocacy (asset-forfeiture/victim rights), and AI tooling. I’m rebuilding capacity after post-Cushing’s recovery, so I bias toward small, structured discussions and concrete next steps.

A few quick points of alignment + questions:

Data governance / consent — I’m interested in mechanisms that make “consented use” provable and revocable (e.g., per-subject attestations with audit trails). Have you found any workable designs that avoid dark patterns and keep enforcement costs low?

Transparent, auditable inference — Beyond reproducible training, I’m looking at “reproducible answers”: pinning model snapshots + prompts + retrieval corpora so third parties can re-run specific claims. Are you pursuing anything like signed inference receipts or verifiable retrieval logs?

Decentralized incentives — My bias is that you only get durable transparency if someone pays for it.

Safety vs. capture — Curious how you’re thinking about preventing governance from collapsing into cartelized gatekeeping while still stopping obvious abuse. Any governance schematics you’d be willing to share?

If you’re up for it, I’d love a short, agenda-driven chat (30–40 min). I can send a 1-pager on the whistleblowing/insurance governance work I’m doing (decentralized claim-grading + anti-collusion incentives) and would be happy to read one of your POCs in exchange. Prefer a low-noise venue or a quick video call.

Either way, if you can point me to the best thread/paper that captures your current architecture, I’ll read it before we talk.

Open Thread – Autumn 2023

Joshua Davis1mo10

A few quick points of alignment + questions:

Decentralized incentives — My bias is that you only get durable transparency if someone pays for it. Have your proofs of concept explored staking/penalty mechanisms (e.g., slashing for unverifiable outputs or misdeclared data provenance) rather than just “open” licenses?

Either way, if you can point me to the best thread/paper that captures your current architecture, I’ll read it before we talk.

So You Think You've Awoken ChatGPT

Joshua Davis3mo30

OK I was directed here by https://aisafety.quest/ and I fall into this camp:
"Your instance of ChatGPT helped you clarify some ideas on a thorny problem (perhaps ... AI alignment) "

I like this suggestion and I'll try to do this:
"Write your idea yourself, totally unassisted. Resist the urge to lean on LLM feedback during the process, and share your idea with other humans instead. It can help to try to produce the simplest version possible first; fit it in a few sentences, and see if it bounces off people. But you're going to need to make the prose your own, first."

I care about this idea and although I'm not an academic I'm willing to invest the time to get this right. I got a lot of benefit from this blog post.

I will point out one thing. If someone has been using ChatGPT, and Claude, and Grok, and Gemini since ChatGPT 3.5 first came out then they will be more inclined to think that they've mastered the sycophancy inherent in model output. They will tend to think that by simply taking outputs from one model and feeding them as intputs into another model with anti-sycophantic prompts this will "solve" the problem of sycophancy. I even have anti-sycophantic custom instructions for Chat-GPT which I will use for perplexity, Claude and Gemini on occasion:
https://docs.google.com/document/d/1GlNtHJf20Zw3XpfYRtwStgpIKbV0JU6DO2BZal_cE4U/edit?usp=sharing

I agree with what you've said and I'm serious about contributing to the problem of AI alignment so now I need to roll up my sleeves and do the very hard work of actually reworking my idea in my own words. It's tough, I've had a lot of chronic health conditions but I'm not giving up. It's too easy to lean on these models as a cruch.

Once I've taken the time to rewrite the outputs from the model in my own words and I'm really serious about working on solutions to corrigibility I will return and hopefully figure out where I can appropriately contribute.

I admit that I'm naieve enough to think that one can develop an "immunity to sycophancy" by just assuming the models will always be sycophantic by default. People like me think, "yeah that happened to all those users simply because they haven't spent hundreds of hours using these models and they don't understand how they work." But somehow I don't think this attitude is going to have any authority here, which is good. I accept it and I'll contribute as best I can and I'll return as soon as I have something appropriate to submit.

I did wish I had a bit of help from a human but that's another issue.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments