This is a Heuristic That Almost Always Works, and it's the one most likely to cut off our chances of solving alignment. Almost all clever schemes are doomed, but if we as a community let that meme stop us from assessing the object level question of how (and whether!) each clever scheme is doomed then we are guaranteed not to find one.
Security mindset means look for flaws, not assume all plans are so doomed you don't need to look.
If this is, in fact, a utility function which if followed would lead to a good future, that is concrete progress and lays out a new set of true names as a win condition. Not a solution, we can't train AIs with arbitrary goals, but it's progress in the same way that quantilizers was progress on mild optimization.
Not inspired by them, no. Those did not have, as far as I'm aware, a clear outlet for use of the outputs. We have a whole platform we've been building towards for three years (starting on the FAQ long before those contests), and the ability to point large numbers of people at that platform once it has great content thanks to Rob Miles.
As I said over on your Discord, this feels like it has a shard of hope, and the kind of thing that could plausibly work if we could hand AIs utility functions.
I'd be interested to see the explicit breakdown of the true names you need for this proposal.
Agreed, incentives probably block this from being picked up by megacorps. I had thought to try and get Musk's twitter to adopt it at one point when he was talking about bots a lot, it would be very effective, but doesn't allow rent extraction in the same way the solution he settled on (paid twitter blue).
Websites which have the slack to allow users to improve their experience even if it costs engagement might be better adopters, LessWrong has shown they will do this with e.g. batching karma daily by default to avoid dopamine addiction.
Hypothesis #2: These bits of history are wrong for reasons you can check with simpler learned structures.
Maybe these historical patterns are easier to disprove with simple exclusions, like "these things were in different places"?
And if you use common but obviously wrong science or maths, it is less likely to.
Yeah, my guess is if you use really niche and plausible-sounding historical examples it is much more likely to hallucinate.
Maybe the RLHF agent selected for expects the person giving feedback to correct it for the history example, but not know the latter example is false. If you asked a large sample of humans, more would be able to confidently say the first example is false than the latter one.
Yeah, that makes a lot of sense and fits my experience of what works.
Meta's previous LLM, OPT-175B, seemed good by benchmarks but was widely agreed to be much, much worse than GPT-3 (not even necessarily better than GPT-Neo-20b). It's an informed guess, not a random dunk, and does leave open the possibility that they're turned it around and have a great model this time rather than something which goodharts the benchmarks.