Ronny Fernandez — LessWrong

The Most Common Bad Argument In These Parts

Curated. This does indeed seem like a common kind of bad argument around these parts which has not yet been named. I also appreciate Rohin's comment pointing out that it's not obvious what makes this kind of reasoning bad, as well as David Manheim's comment saying that what is needed is a way to distinguish cases when bounded search works well from cases where bounded search works poorly. More generally, I like content being posted that are about evaluating a kind of reasoning that is common, especially of the sort that inspires interesting engagement and/or disagreement in the replies. I would be excited to see more case studies in when this sort of reasoning works well or poorly, and maybe even a general theory to help us decide when this kind of reasoning tends to work out well, eg, when implemented by superforecasters on many topics.

Alignment remains a hard, unsolved problem

Ronny Fernandez16d42

Curated. I have wanted someone to write out an assessment of how the Risks from Learned Optimization arguments hold up in light of the evidence we have acquired over the last half decade. I particularly appreciated breaking down the potential reasons for risk and assessing to what degree we have encountered each problem, as well as reassessing the chances of running into those problems. I would love to see more posts that take arguments/models/concepts from before 2020, consider what predictions we should have made pre-2020 if these arguments/models/concepts were good, and then reassess them in light of our observations of progress in ML over the last five years.

Legible vs. Illegible AI Safety Problems

Ronny Fernandez1mo100

Curated. This is a simple and obvious argument that I have never heard before with important implications. I have heard similar considerations in conversations about whether someone should take some job at a capabilities lab, or whether some particular safety technique is worth working on, but it's valuable to generalize across those cases and have a central place for discussing the generalized argument.

I would love to see more pushback in the comments from those who are currently working on legible safety problems.

How Does A Blind Model See The Earth?

Ronny Fernandez4mo82

Is this coming just from the models having geographic data in their training? Much less impressive if so but still cool.

A case for courage, when speaking of AI danger

Ronny Fernandez5mo0-2

To check, do you have particular people in mind for this hypothesis? Seems kinda rude to name them here, but could you maybe send me some guesses privately? I currently don't find this hypothesis as stated very plausible, or like sure maybe, but I think it's a relatively small fraction of the effect.

A case for courage, when speaking of AI danger

Ronny Fernandez5mo90

Curated. I have been at times more cautious in communicating my object level views than I now wish I had been. I appreciate this post as a flag for courage: something others might see, and which might counter some of the (according to me) prevailing messages of caution. Those messages, at least in my case, significantly contributed to my caution, and I wish there had been something like this post around for me to read before I had to decide how cautious to be.

The argument this post presents for the conclusion that many people should be braver in communicating about AI x-risk by their own lights, is only moderately convincing. It relies heavily on how the book’s blurbs were sampled, and it seems likely that a lot of optimization went into getting strong snippets rather than a representative sample. I find it hard to update much on this without knowing the details of how the blurbs were collected, even though you address this specific concern. Still, it's not totally unconvincing.

I’d like to see more empirical research into what kinds of rhetoric work to achieve which aims when communicating with different audiences about AI x-risk. This seems like the sort of thing humanity has already specced into studying, and I’d love to see more writeups applying that existing competence to these questions.

I also would have liked to see more of Nate’s personal story: how he came to hold his current views. My impression is that he didn’t always so confidently believe people should more hold the courage of their convictions when talking about AI x-risk. A record of how his mind changed over time, and what observations/thoughts/events caused that change, could be informative for others in an importantly different way from how this post or empirical work on the question might be. I’d love to see a post from Nate on that in the future.

METR: Measuring AI Ability to Complete Long Tasks

Ronny Fernandez9mo60

Curated. Comparing model performance on tasks to the time human experts need to complete the same tasks (with fixed reliability) is worth highlighting since it helps operationalize terms like "human-level-AI" and "AI-level-of-capabilities" in general. Furthermore, by making this empirical comparison and discovering a 7-month doubling time, this work significantly reduces our uncertainty about both when to expect certain capabilities (and more impressively according to me) how to conceptualize those AI capability levels. That is, on top of reducing our uncertainty, I think this work also provides a good general format / frame for reporting general AI capabilities forecasts, eg, we have X years until models can do things that it takes human experts Y hours to do with reliability Z%.

I also appreciated the discussions this post inspired about whether we should expect the slope in log-space to change, and if so in which direction, as well as the related discussion about whether we should expect this trend to go superexponential. Interesting arguments and models were put forth in both discussions.

I hope in the future METR explores other methods for concretizing/operationalizing and forecasting AI capability levels. For example, comparing human expert reliability in general within specific task domains to model task reliability within those same domains, or comparing the time humans take to become reliable experts in certain domains to model task reliability within those same domains.

What Is The Alignment Problem?

Ronny Fernandez10mo5-1

Curated. Tackles thorny conceptual issues at the foundation of AI alignment while also revealing the weak spots of the abstractions used to do so.

I like the general strategy of trying to make progress on understanding the problem relying only on the concept of "basic agency" without having to work on the much harder problem of coming up with a useful formalism of a more full throated conception of agency, whether or not that turns out to be enough in the end.

The core point of the post: that certain kinds of goals only make sense at all given that there are certain kinds of patterns present in the environment, and that most of the problem of making sense of the alignment problem is identifying what those patterns are for the goal of "make aligned AGIs", is plausible and worthy of discussion. I also appreciate that this post elucidates the (according to me) canon-around-these-parts general patterns that render the specific goal of aligning AGIs sensible (eg, compression based analyses of optimization) and presents them as such explicitly.

The introductory examples of patterns that must be present in the general environment for certain simpler goals to make sense—especially how the absence of the pattern makes the goal not make sense—are clear and evocative. I would not be surprised if they helped someone notice that there are some ways that the canon-around-these-parts hypothesized patterns which render "align AGIs" a sensible goal are importantly flawed.

Judgements: Merging Prediction & Evidence

Ronny Fernandez10mo80

Curated. The problem of certain evidence is an old fundamental problem in Bayesian epistemology and this post makes a simple and powerful conceptual point tied to a standard way of trying to resolve that problem. Explaining how to think about certain evidence vs. something like Jefferey's conditionalization under the prediction market analogy of a Bayesian agent is itself valuable. Further pointing out both that:

1) You can think of evidence and hypotheses as objects of the same type signature using the analogy.

And

2) The difference between them is revealed by the analogy to be a quantitative rather than qualitative difference.

Moves me much further in the direction of thinking that radical probabilism will be a fruitful research program. Unfruitful research programs rarely reveal deep underlying similarities between seemingly very different types of fundamental objects.

Lighthaven Sequences Reading Group #7 (Tuesday 10/22)

Ronny Fernandez1y20

There is! It is now posted! Sorry about the delay.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments