Engineer at EMX.com. Donor to LW 2.0.
my enthusiasm for the project leads me to be a bit coy about revealing too much detail on the internet
FWIW, it may be worth keeping in mind the Silicon Valley maxim that ideas are cheap, and execution is what matters. In most cases you're far more likely to make progress on the idea if you get it out into the open, especially if execution at all depends on having collaborators or other supporters. (Also helpful to get feedback on the idea.) The probability that someone else successfully executes on an idea that you came up with is low.
Ah! It's much clearer to me now what you're looking for.
Two things that come to mind as vaguely similar:
1) The habit of some rationalist bloggers of flagging claims with "epistemic status". (E.g. here or here)
2) Wikipedia's guidelines for verifiability (and various other guidelines that they have)
Of course, neither is exactly what you're talking about, but perhaps they could serve as inspiration.
I suppose there is a continuum of how much insight the human has into what the agent is doing. Squeezing all your evaluation into one simple reward function would be on one end of the spectrum (and particularly susceptible to unintended behaviors), and then watching a 2d projection of a 3d action would be further along the spectrum (but not all the way to full insight), and then you can imagine setups with much more insight than that.
We already see examples of RL leading to undesirable behaviours that superficially ‘look good’ to human evaluators (see this collection of examples).
Nitpick: this description seems potentially misleading to me (at least it gave me the wrong impression at first!). When I read this, it sound like it's saying that a human looked at what the AI did and thought it was good, before they (or someone else) dug deeper.
But the examples (the ones I spot-checked anyway) all seem to be cases where the systems satisfied some predefined goal in an unexpected way, but if a human had looked at what the agent did (and not just the final score), it would have been obvious that the agent wasn't doing what was expected / wanted. (E.g. "A classic example is OpenAI’s demo of a reinforcement learning agent in a boat racing game going in circles and repeatedly hitting the same reward targets instead of actually playing the game." or "A robotic arm trained using hindsight experience replay to slide a block to a target position on a table achieves the goal by moving the table itself." A human evaluator wouldn't have looked at these behaviors and said, "Looks good!")
I call this out just because the world where ML systems are actually routinely managing to fool humans into thinking they're doing something useful when they're not is much scarier than one where they're just gaming pre-specified reward functions. And I think it's important to distinguish the two.
EDIT: this example does seem to fit the bill though:
One example from an OpenAI paper is an agent learning incorrect behaviours in a 3d simulator, because the behaviours look like the desired behaviour in the 2d clip the human evaluator is seeing.
My original response was a more abstract explaination of why I think describing money as "not real" is misleading but maybe I more direct response to the article would be more useful since I think that part actually isn't core to your question.
Just to clarify, I'm not the OP. It just seemed to me like you and the OP were saying something similar.
describing at as a way of temporarily motivating strangers seems like a misunderstanding of what money is
The innovation of money itself is that we can get favors repaid by complete strangers without having to figure out the chain of favors every time
What's the difference? It sounds like the two of you are saying the same thing. Except just that you don't like using the term "illusion" to describe it?
I think that means I'm [...] bad at describing/ searching for what I'm looking for.
One thing that might help, in terms of understanding what you're looking for, is -- how do you expect to be able to use this "model of ranking"?
It's not quite clear to me whether you're looking for something like an algorithm -- where somebody could code it up as a computer program and you could feed in sentences and it will spit out scores, or something more like a framework or rubrik -- where the work of understanding and evaluating sentences will still be done by people, but they can use the framework/rubrik as a guide to decide how to rate the sentences, or something else.
I can't say that I'm familiar with the morass that you speak of. I work in clinical medicine and tend to just have a 10,000 mile view on philosophy. Can you maybe elaborate on what you see the problem as?
You might want to take a look at the A Human's Guide to Words sequence. (Or, for a summary, see just the last post in that sequence: 37 Ways That Words Can Be Wrong.)
I have come to the conclusion that this is just a bandaid on a more fundamental problem. Whether we should choose total, average or even median utility isn’t something we could objectively decide. So I suggest that we go up one level, and maximize what most people want to maximize.
If you haven't seen it, you may find this paper interesting: Geometric reasons for normalising variance toaggregate preferences, by Owen Cotton-Barratt (as an example of another potentially elegant approach to aggregating preferences).
Relatedly, I'm going back and forth in my head a bit about whether it's better to explain category theory in graph theory terms by identifying the morphisms with edges or with paths.
Morphisms = Edges
Morphisms = Paths
I think this final point in the morphisms = paths formulation might be what tripped you up in the case Eigil points out above, where category theory treats two arrows from A to B that are equivalent to each other as actually the same arrow. This seems to be the one place (from what I can see so far) where the paths formulation gives the wrong intuition.