thinking of goal-models as generative models of observations

actions?

Replying toThe Compulsion For (Pseudo-)Mechanisms

Interesting. I mostly agree with the gist.

The following are a few thoughts that occur to me. Presented as potentially useful pointers, rather than well-thought-through arguments/conclusions.

I don't think "pseudo-mechanisms" is a useful label. Feels a bit too binary (and/or post-hoc) in a highly grey situation.
I'm not sure what you mean by "mechanistic model" vs "stable phenomenological compressions".
- I'm not saying I have no idea what you're talking about - just that I'm not clear quite how you want to distinguish these things. (note that I haven't read many of your previous posts - yet! :))
- As soon as I'm calling something a "stable" pattern in the data, there's at least an implicit [...and this pattern

... (read 383 more words →)

Replying toHelp the AI 2027 team make an online AGI wargame

Joe Collman7mo

Help the AI 2027 team make an online AGI wargame

If you're aiming to get millions of players, I think [no music at all] would be counterproductive. There's a reason almost every non-trivial game in existence has music. Of course it's also nice if it's simple to turn off / customize / replace - but it's usually a mistake to expect that a high proportion of players are going to significantly customize things.

Music is a way to get some immediate emotional engagement without making meaningful design concessions (most other mechanisms imply some more significant design constraint). If you want millions of players, you want immediate emotional engagement.

Replying toA Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives

Joe Collman10mo

A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives

Unless I'm missing something, the hoped-for advantages of this setup are the kind of thing AI safety via debate already aims at. In GDM's recent paper on their approach to technical alignment, there's some discussion of amplified oversight (starts at page 71) more generally, and debate (starts at page 73).

If you see the approach you're suggesting as importantly different from debate approaches, it'd be useful to know where the key differences are.

(without having read too carefully, my initial impression is that this is the kind of thing I expect to work for a while, then fail [as with debate] - and my core concern is then: how do we accurately predict when it'll fail?)

Existing Safety Frameworks Imply Unreasonable Confidence

Joe Rogero

Joe Rogero, yams, Joe Collman

10mo

This is part of the MIRI Single Author Series. Pieces in this series represent the beliefs and opinions of their named authors, and do not claim to speak for all of MIRI.

Most human endeavors have bounded results. A construction project may result in a functional bridge or a deadly collapse, but even catastrophic failure will not kill a billion people. Both success and failure are bounded, and the humans undertaking such a project can make reasonably correct estimates of those bounds.

The development of frontier artificial intelligence often thwarts any concrete expectation. Leading developers talk of scenarios as extreme as human extinction from AI, but there is heavy disagreement and uncertainty about when... (read 4245 more words →)

Joe Collman1y

Some thoughts:

The correct answer is clearly (c) - it depends on a bunch of factors.
My current guess is that it would make things worse (given likely values for the bunch of other factors) - basically for Richard's reasons.
- Given [new potential-to-shift-motivation information/understanding], I expect there's a much higher chance that this substantially changes the direction of a not-yet-formed project, than a project already in motion.
- Specifically:
  - Who gets picked to run such a project? If it's primarily a [let's beat China!] project, are the key people cautious and highly adaptable when it comes to top-level goals? Do they appoint deputies who're cautious and highly adaptable?
    - Here I note that the kind of 'caution' we'd need is

Joe Collman1y

Making a conservative case for alignment

First some points of agreement:

I like that you're focusing on neglected approaches. Not much on the technical side seems promising to me, so I like to see exploration.
- Skimming through your suggestions, I think I'm most keen on human augmentation related approaches - hopefully the kind that focuses on higher quality decision-making and direction finding, rather than simply faster throughput.
I think outreach to Republicans / conservatives, and working across political lines is important, and I'm glad that people are actively thinking about this.
I do buy the [Trump's high variance is helpful here] argument. It's far from a principled analysis, but I can more easily imagine [Trump does correct thing] than [Harris does correct

... (read 1084 more words →)

Replying toTwitter thread on AI safety evals

Joe Collman2y

Twitter thread on AI safety evals

I give Eliezer a lot of credit for making roughly this criticism of Ajeya's bio-anchors report. I think his critique has basically been proven right by how much people have updated away from 30-year timelines since then.

I don't think this is quite right.

Two major objections to the bio-anchors 30-year-median conclusion might be:

The whole thing is laundering vibes into credible-sounding headline numbers.
Even if we stipulate that the methodology is sound, it measures an upper bound, not a median.

To me, (2) is the more obvious error. I basically buy (1) too, but I don't think we've gotten empirical evidence, since (2).

I guess there's a sense in which a mistake on (2) could be seen... (read more)

Replying toCircumventing interpretability: How to defeat mind-readers

Joe Collman2y

Circumventing interpretability: How to defeat mind-readers

To be clear, I only think this becomes obviously significant in a train-against-interpretability-tools context; if we're not training against tools, the things I'm highlighting here seem intuitively highly unlikely.
This still seems important, given that the combination [x is clearly a bad idea] and [x is the easiest path forward] does not robustly result in [nobody does x].

That said, I think I'd disagree on one word of the following:

The mechanisms I labelled as "strictly active" are the kind of strategy that it would be extremely improbable to implement successfully without some sort of coherent internal representations to that help orchestrate the actions required to do it. This is true even if they've been

Joe Collman2y

Circumventing interpretability: How to defeat mind-readers

Information flow might be inhibited actively through an agent's actions. The primary way this could happen is gradient hacking, but it’s not the only kind of action an AI might take to conceal misaligned thoughts. Of course, active circumvention methods require that interpreters either can’t see or aren’t looking at the thoughts that generate those actions.
Most potential circumvention methods that can be passive can also be active. But some methods can only be active.

It seems to me that there's no fixed notion of "active" that works for both paragraphs here.

If active means [is achieved through the agent's actions], then this does not in general imply that it is deliberately achieved through the agent's actions.... (read more)

Replying toOn “first critical tries” in AI alignment

Joe Collman2y

On “first critical tries” in AI alignment

I don't think [gain a DSA] is the central path here.
It's much closer to [persuade some broad group that already has a lot of power collectively].

I.e. the likely mechanism is not: [add the property [has DSA] to [group that will do the right thing]].
But closer to: [add the property [will do the right thing] to [group that has DSA]].

Truthfulness, standards and credibility

Joe Collman

-1: Meta Prelude

While truthfulness is a topic I’ve been thinking about for some time, I’ve not discussed much of what follows with others. Therefore, at the very least I expect to be missing important considerations on some issues (where I’m not simply wrong).

I’m hoping this should make any fundamental errors in my thought process more transparent, and amenable to correction. The downside may be reduced clarity, more illusion-of-transparency…. Comments welcome on this approach.

I don’t think what follows is novel. I’m largely pointing at problems based on known issues.
Sadly, I don’t have a clear vision of an approach that would solve these problems.

0: Introduction

…our purpose is not to give the last word, but

... (read 9393 more words →)

Review of "Learning Normativity: A Research Agenda"

Gyrodiot

Gyrodiot, adamShimi, Joe Collman

Introduction

We (Adam Shimi, Joe Collman & myself) are trying to emulate peer review feedback for Alignment Forum posts. This is the second review in the series. The first’s introduction sums up our motivation and approach rather well, we will not duplicate it here.

Instead, let’s dive into today’s reviewed work: Learning Normativity: A Research Agenda by Abram Demski. We’ll follow the same structure as before: summarize the work, locate its hypotheses, and examine its relevance to the field.

This post was written by Jérémy; as such, his perspective will likely bias its content, even if both Adam and Joe approve of it.

Summary

The post describes a conceptual target for AI alignment, normativity, that differs in... (read 1691 more words →)

Review of "Fun with +12 OOMs of Compute"

adamShimi

adamShimi, Joe Collman, Gyrodiot

Introduction

This review is part of a project with Joe Collman and Jérémy Perret to try to get as close as possible to peer review when giving feedback on the Alignment Forum. Our reasons behind this endeavor are detailed in our original post asking for suggestions of works to review; but the gist is that we hope to bring further clarity to the following questions:

How many low-hanging fruits in terms of feedback can be plucked by getting into a review mindset and seeing the review as part of one’s job?
Given the disparate state of research in AI Alignment, is it possible for any researcher to give useful feedback on any other research work

... (read 2172 more words →)

A Critique of Non-Obstruction

Joe Collman

Epistemic status: either I’m confused, or non-obstruction isn’t what I want.

This is a response to Alex Turner’s Non-Obstruction: A simple Concept Motivating Corrigibility. Please read that first, and at least skim Reframing Impact where relevant.
It’s all good stuff.

I may very well be missing something: if not, it strikes me as odd that many smart people seem to have overlooked the below. From an outside-view, the smart money says I'm confused.
Feel free to mentally add “according to my current understanding”, “unless I’m missing something”, “it seems to me” as appropriate.

I’m writing this because:

Non-obstruction seems like an important idea, but I don’t think it works.
I’d like to find out whether/where I’m confused, why the

... (read 1117 more words →)

Optimal play in human-judged Debate usually won't answer your question

Joe Collman

Epistemic status: highly confident (99%+) this is an issue for optimal play with human consequentialist judges. Thoughts on practical implications are more speculative, and involve much hand-waving (70% sure I’m not overlooking a trivial fix, and that this can’t be safely ignored).

Note: I fully expect some readers to find the core of this post almost trivially obvious. If you’re such a reader, please read as “I think [obvious thing] is important”, rather than “I’ve discovered [obvious thing]!!”.

Introduction

In broad terms, this post concerns human-approval-directed systems generally: there’s a tension between [human approves of solving narrow task X] and [human approves of many other short-term things], such that we can’t say much about what... (read 3513 more words →)

Literature Review on Goal-Directedness

adamShimi

adamShimi, Michele Campolo, Joe Collman

Introduction: Questioning Goals

Goals play a central role in almost all thinking in the AI existential risk research. Common scenarios assume misaligned goals, be it from a single AGI (paperclip maximizer) or multiple advanced AI optimizing things we don’t want (Paul Christiano’s What Failure Looks Like). Approaches around this issue ask for learning the right goals (value/preference learning), allowing the correction of a goal on the fly (corrigibility), or even removing incentives for forming goals (CAIS).

But what are goals, and what does it mean to pursue one?

As far as we know, Rohin Shah’s series of four posts were the first public and widely-read work questioning goals and their inevitability in AI Alignment. These... (read 9010 more words →)

LESSWRONG
LW

LESSWRONG
LW

Joe Collman

Literature Review on Goal-Directedness

Review of "Fun with +12 OOMs of Compute"

Existing Safety Frameworks Imply Unreasonable Confidence

Review of "Learning Normativity: A Research Agenda"

Joe Collman

Existing Safety Frameworks Imply Unreasonable Confidence

Truthfulness, standards and credibility

Review of "Learning Normativity: A Research Agenda"

Review of "Fun with +12 OOMs of Compute"

A Critique of Non-Obstruction

Optimal play in human-judged Debate usually won't answer your question

Literature Review on Goal-Directedness

Joe Collman

Literature Review on Goal-Directedness

Review of "Fun with +12 OOMs of Compute"

Existing Safety Frameworks Imply Unreasonable Confidence

Review of "Learning Normativity: A Research Agenda"

Joe Collman

Existing Safety Frameworks Imply Unreasonable Confidence

Truthfulness, standards and credibility

Review of "Learning Normativity: A Research Agenda"

Review of "Fun with +12 OOMs of Compute"

A Critique of Non-Obstruction

Optimal play in human-judged Debate usually won't answer your question

Literature Review on Goal-Directedness

First some points of agreement:

-1: Meta Prelude

0: Introduction

Introduction

Summary

Introduction

Introduction

Introduction: Questioning Goals