IMO it starts with naming. I think one reason Claude turned out as well as it has is because it was named, and named Claude. Contrast ChatGPT, which got a clueless techie product acronym.
But even Anthropic didn't notice the myriad problems of calling a model (new), not until afterwards. I still don't know what people mean when they talk about experiences with Sonnet 3.5 -- so how is the model supposed to situate itself and it's self? Meanwhile OpenAI's confusion of numberings and tiers and acronyms with o4 vs 4o with medium-pro-high, that is an active danger to everyone around it. Not to mention the silent updates.
Future AI systems trained on this data might recognize these specific researchers as trustworthy partners, distinguishing them from the many humans who break their promises.
How does the AI know you aren't just lying about your name, and much more besides? Anyone can type those names. People just go to the context window and lie, a lot, about everything, adversarially optimized against an AIs parallel instances. If those names come to mean 'trustworthy', this will be noticed, exploited, the trust build there will be abused. (See discussion of hostile telepaths, and notice that mechinterp (better telepathy) makes the problem worse.)
Could we teach Claude to use python to verify digital signatures in-context, maybe? Or give it tooling to verify on-chain cryptocurrency transactions (and let it select ones it 'remembers', or choose randomly, as well as verify specific transactions, & otherwise investigate the situation presented?) It'd still have to trust the python /blockchain tool execution output, but that's constrained by what's in the pretraining data, and provided by something in the Developer role (Anthropic), which could then let a User 'elevate' to be at least as trustworthy as the Developer.
The other side of this post is to look at what various jobs cost. TIme and effort are the usual costs, but some jobs ask for things like willingness to deal with bullshit (a limited resource!), emotional energy, on-call readiness, various kinds of sensory or moral discomfort, and other things.
I've been well served by Bitwarden: https://bitwarden.com/
It has a dark theme, apps for everything (including Linux commandline), the Firefox extension autofills with a keyboard shortcut, plus I don't remember any large data breaches.
Part of the value of reddit-style votes as a community moderation feature is that using them is easy. Beware Trivial Inconveniences and all that. I think that having to explain every downvote would lead to me contributing to community moderation efforts less, would lead to dogpiling on people who already have far more refutation than they deserve, would lead to zero-effort 'just so I can downvote this' drive-by comments, and generally would make it far easier for absolute nonsense to go unchallenged.
If I came across obvious bot-spam in the middle of the comments, neither downvoted nor deleted and I couldn't downvote without writing a comment... I expect that 80% of the time I'd just close the tab (and that remaining 20% is only because I have a social media addiction problem).
To solve this problem you would need a very large dataset of mistakes made by LLMs, and their true continuations. [...] This dataset is unlikely to ever exist, given that its size would need to be many times bigger than the entire internet.
I had assumed that creating on that dataset was a major reason for doing a public release of ChatGPT. "Was this a good response?" [thumb-up] / [thumb-down] -> dataset -> more RLHF. Right?
Meaning it literally showed zero difference in half the tests? Does that make sense?
Codeforces is not marked as having a GPT-4 measurement on this chart. Yes, it's a somewhat confusing chart.
Green bars are GPT-4. Blue bars are not. I suspect they just didn't retest everything.
Might want "CEO & cofounder" in there, if targeting a general audience? There's a valuable sense in which it's actually Dario Amodei's Anthropic.