Wiki Contributions


Neural network interpretability feels like it should be called neural network interpretation.

If you could tractably obtain and work with the posterior, I think that would be much more useful than a normal ensemble. E.g. being able to choose at what threshold you start paying attention to a hypothesis which predicts harm, and vary it depending on the context, seems like a big plus.

I think the reason Bayesian ML isn't that widely used is because it's intractable to do. So Bengio's stuff would have to succesfully make it competitive with other methods.

I agree this agenda wants to solve ELK by making legible predictive models.

But even if it doesn’t succeed in that, it could bypass/solve ELK as it presents itself in the problem of evaluating whether a given plan has a harmful outcome, by training a posterior over harm evaluation models rather than a single one, and then acting conservatively with respect to that posterior. The part which seems like a natural byproduct of the Bayesian approach is “more systematically covering the space of plausible harm evaluation models”.

It's not just a lesswrong thing (wikipedia).

My feeling is that (like most jargon) it's to avoid ambiguity arising from the fact that "commitment" has multiple meanings. When I google commitment I get the following two definitions:

  1. the state or quality of being dedicated to a cause, activity, etc.
  2. an engagement or obligation that restricts freedom of action

Precommitment is a synonym for the second meaning, but not the first. When you say, "the agent commits to 1-boxing," there's no ambiguity as to which type of commitment you mean, so it seems pointless. But if you were to say, "commitment can get agents more utility," it might sound like you were saying, "dedication can get agents more utility," which is also true.

IIUC, I think that in addition to making predictive models more human interpretable, there's another way this agenda aspires to get around the ELK problem.

Rather than having a range of predictive models/hypotheses but just a single notion of what constitutes a bad outcome, it wants to also learn posterior over hypotheses for what constitutes a bad outcome, and then act conservatively with respect to that.

IIRC the ELK report is about trying to learn an auxilliary model which reports on whether the predictive model predicts a bad outcome, and we want it to learn the "right" notion of what constitutes a bad outcome, rather than a wrong one like "a human's best guess would be that this outcome is bad". I think Bengio's proposal aspires to learn a range of plausible auxilliary models, and sufficiently cover that space that the both the "right" notion and the wrong one are in there, and then if any of those models predict a bad outcome, call it a bad plan.

EDIT: from a quick look at the ELK report, this idea ("ensembling") is mentioned under "How we'd approach ELK in practice". Doing ensembles well seems like sort of the whole point of the AI scientists idea, so it's plausible to me that this agenda could make progress on ELK even if it wasn't specifically thinking about the problem.

"Bear in mind he could be wrong" works well for telling somebody else to track a hypothesis.

"I'm bearing in mind he could be wrong" is slightly clunkier but works ok.

Really great post. The pictures are all broken for me, though.

He means the second one.

Seems true in the extreme (if you have 0 idea what something is how can you reasonably be worried about it), but less strange the futher you get from that.

Somewhat related: how do we not have separate words for these two meanings of 'maximise'?

  1. literally set something to its maximum value
  2. try to set it to a big value, the bigger the better

Even what I've written for (2) doesn't feel like it unambiguously captures the generally understood meaning of 'maximise' in common phrases like 'RL algorithms maximise reward' or 'I'm trying to maximise my income'. I think the really precise version would be 'try to affect something, having a preference ordering over outcomes which is monotonic in their size'.

But surely this concept deserves a single word. Does anyone know a good word for this, or feel like coining one?

Load More