Engineer at Donor to LW 2.0.

ESRogs's Comments

Hello, is it you I'm looking for?
my enthusiasm for the project leads me to be a bit coy about revealing too much detail on the internet

FWIW, it may be worth keeping in mind the Silicon Valley maxim that ideas are cheap, and execution is what matters. In most cases you're far more likely to make progress on the idea if you get it out into the open, especially if execution at all depends on having collaborators or other supporters. (Also helpful to get feedback on the idea.) The probability that someone else successfully executes on an idea that you came up with is low.

Hello, is it you I'm looking for?

Ah! It's much clearer to me now what you're looking for.

Two things that come to mind as vaguely similar:

1) The habit of some rationalist bloggers of flagging claims with "epistemic status". (E.g. here or here)

2) Wikipedia's guidelines for verifiability (and various other guidelines that they have)

Of course, neither is exactly what you're talking about, but perhaps they could serve as inspiration.

Writeup: Progress on AI Safety via Debate

I suppose there is a continuum of how much insight the human has into what the agent is doing. Squeezing all your evaluation into one simple reward function would be on one end of the spectrum (and particularly susceptible to unintended behaviors), and then watching a 2d projection of a 3d action would be further along the spectrum (but not all the way to full insight), and then you can imagine setups with much more insight than that.

Writeup: Progress on AI Safety via Debate
We already see examples of RL leading to undesirable behaviours that superficially ‘look good’ to human evaluators (see this collection of examples).

Nitpick: this description seems potentially misleading to me (at least it gave me the wrong impression at first!). When I read this, it sound like it's saying that a human looked at what the AI did and thought it was good, before they (or someone else) dug deeper.

But the examples (the ones I spot-checked anyway) all seem to be cases where the systems satisfied some predefined goal in an unexpected way, but if a human had looked at what the agent did (and not just the final score), it would have been obvious that the agent wasn't doing what was expected / wanted. (E.g. "A classic example is OpenAI’s demo of a reinforcement learning agent in a boat racing game going in circles and repeatedly hitting the same reward targets instead of actually playing the game." or "A robotic arm trained using hindsight experience replay to slide a block to a target position on a table achieves the goal by moving the table itself." A human evaluator wouldn't have looked at these behaviors and said, "Looks good!")

I call this out just because the world where ML systems are actually routinely managing to fool humans into thinking they're doing something useful when they're not is much scarier than one where they're just gaming pre-specified reward functions. And I think it's important to distinguish the two.

EDIT: this example does seem to fit the bill though:

One example from an OpenAI paper is an agent learning incorrect behaviours in a 3d simulator, because the behaviours look like the desired behaviour in the 2d clip the human evaluator is seeing.
Money isn't real. When you donate money to a charity, how does it actually help?
My original response was a more abstract explaination of why I think describing money as "not real" is misleading but maybe I more direct response to the article would be more useful since I think that part actually isn't core to your question.

Just to clarify, I'm not the OP. It just seemed to me like you and the OP were saying something similar.

Money isn't real. When you donate money to a charity, how does it actually help?
describing at as a way of temporarily motivating strangers seems like a misunderstanding of what money is

Compare to:

The innovation of money itself is that we can get favors repaid by complete strangers without having to figure out the chain of favors every time

What's the difference? It sounds like the two of you are saying the same thing. Except just that you don't like using the term "illusion" to describe it?

Hello, is it you I'm looking for?
I think that means I'm [...] bad at describing/ searching for what I'm looking for.

One thing that might help, in terms of understanding what you're looking for, is -- how do you expect to be able to use this "model of ranking"?

It's not quite clear to me whether you're looking for something like an algorithm -- where somebody could code it up as a computer program and you could feed in sentences and it will spit out scores, or something more like a framework or rubrik -- where the work of understanding and evaluating sentences will still be done by people, but they can use the framework/rubrik as a guide to decide how to rate the sentences, or something else.

Hello, is it you I'm looking for?
I can't say that I'm familiar with the morass that you speak of. I work in clinical medicine and tend to just have a 10,000 mile view on philosophy. Can you maybe elaborate on what you see the problem as?

You might want to take a look at the A Human's Guide to Words sequence. (Or, for a summary, see just the last post in that sequence: 37 Ways That Words Can Be Wrong.)

Meta-Preference Utilitarianism
I have come to the conclusion that this is just a bandaid on a more fundamental problem. Whether we should choose total, average or even median utility isn’t something we could objectively decide. So I suggest that we go up one level, and maximize what most people want to maximize.

If you haven't seen it, you may find this paper interesting: Geometric reasons for normalising variance to
aggregate preferences
, by Owen Cotton-Barratt (as an example of another potentially elegant approach to aggregating preferences).

Category Theory Without The Baggage

Relatedly, I'm going back and forth in my head a bit about whether it's better to explain category theory in graph theory terms by identifying the morphisms with edges or with paths.

Morphisms = Edges

  • In this version, a subset of multidigraphs (apparently also called quivers!), can be thought of as categories -- those for which every vertex has an edge to itself, and for which whenever there's a path from A to B, there's also an edge directly from A to B.
  • You also have to say:
    • for each pair of edges from A to B and B to C, which edge from A to C corresponds to their composition
    • for each node, which (of possibly multiple) edge to itself is its default or identity edge
    • in such a way that the associative and unital laws hold

Morphisms = Paths

  • In this version, any multidigraph (quiver) can be thought of as a category.
  • You get the identities for free, because they're just the trivial, do-nothing, paths.
  • You get composition for free, because we already know what it means that following a path from A to M and then from M to Z is itself a path.
  • And you get the associative and unital laws for (almost) free:
    • unital: doing nothing at the start or end of a path obviously doesn't change the path
    • associative: it's natural to think of the paths ((e1, e2), e3) and (e1, (e2, e3)) -- where e1, e2, and e3 are edges -- as both being the same path [e1, e2, e3]
  • However, you now have to add on one weird extra rule that two paths that start and end at the same place can be considered the same path, even if they didn't go through the same intermediate nodes.
    • In other words, the intuition that composing-pairs-of-paths-in-different-orders-always-gives-you-the-same-final-path gives you a sufficient, but not necessary condition for two paths being considered equivalent.

I think this final point in the morphisms = paths formulation might be what tripped you up in the case Eigil points out above, where category theory treats two arrows from A to B that are equivalent to each other as actually the same arrow. This seems to be the one place (from what I can see so far) where the paths formulation gives the wrong intuition.

Load More