Wiki Contributions


Visible Thoughts Project and Bounty Announcement

If it were me, I’d also try to increase RoI by asking people to add commentary to existing books, rather than having people write from scratch.

This thought occurred to me - specifically, there's likely quite a bit of interactive fiction out there with a suitable format which could be post-hoc thought annotated (might also be interesting to include a few different branches).

However, I don't think it gives us the same thing: presumably we'd want the thoughts to be those that occur at the time and contribute to the writing of the later narrative. Doing post-hoc annotations by trying to infer what a writer might plausibly have thought seems a quite different process. Perhaps that wouldn't matter for some purposes, but I imagine it would for others (??).

While it'd be possible to check that post-hoc annotations passed a [human reader can't tell the difference] test, this wouldn't eliminate the difference - it'd only tell us it's non-obvious to humans.

Visible Thoughts Project and Bounty Announcement

I think you're essentially correct - but if I understand you, what you're suggesting is similar to Chris Olah et al's Circuits work (mentioned above in the paragraph starting "This sort of interpretability is distinct..."). If you have a viable approach aiming at that kind of transparency, many people will be eager to provide whatever resources are necessary.
This is being proposed as something different, and almost certainly easier.

One specific thought:

but my intuition suggests this would limit the complexity of the prompt by shackling it's creation to an unnecessary component, the thought

To the extent that this is correct, it's more of a feature than a bug. You'd want the thoughts to narrow the probability distribution over outputs. However, I don't think it's quite right: the output can still have just as much complexity; the thoughts only serve to focus that complexity.

E.g. consider [This will be a realist novel about 15th century France] vs [This will be a surrealist space opera]. An output corresponding to either can be similarly complex.

Visible Thoughts Project and Bounty Announcement

However I also could see the "thoughts" output misleading people - people might mistake the model's explanations as mapping onto the calculations going on inside the model to produce an output.

I think the key point on avoiding this is the intervening-on-the-thoughts part:
"An AI produces thoughts as visible intermediates on the way to story text, allowing us to watch the AI think about how to design its output, and to verify that we can get different sensible outputs by intervening on the thoughts".

So the idea is that you train things in such a way that the thoughts do map onto the calculations going on inside the model.

A positive case for how we might succeed at prosaic AI alignment

Furthermore, LCDT is a demonstration that we can at least reduce the complexity of specifying myopia to the complexity of specifying agency.

It's an interesting idea, but are you confident that LCDT actually works? E.g. have you thought more about the issues I talked about here and concluded they're not serious problems?

I still don't see how we could get e.g. an HCH simulator without agentic components (or the simulator's qualifying as an agent).
As soon as an LCDT agent expects that it may create agentic components in its simulation, it's going to reason horribly about them (e.g. assuming that any adjustment it makes to other parts of its simulation can't possibly impact their existence or behaviour, relative to the prior).

I think LCDT does successfully remove the incentives you're aiming to remove. I just expect it to be too broken to do anything useful. I can't currently see how we could get the good parts without the brokenness.

Discussion with Eliezer Yudkowsky on AGI interventions

Again, I broadly agree - usually I expect sharing to be best. My point is mostly that there's more to account for in general than an [alignment : capability] ratio.

Some thoughts on your specific points:

First, you can share your solution while also writing about its flaws.

Sure, but if the 'flaw' is of the form [doesn't address problem various people don't believe really matters/exists], then it's not clear that this helps. E.g. outer alignment solution that doesn't deal with deceptive alignment.

Second, I think "some TAI will be built whether there is any solution or not" is more likely than "TAI will be built iff a solution is available, even if the solution is flawed".

Here I probably agree with you, on reflection. My initial thought was that the [plausible-flawed-solution] makes things considerably worse, but I don't think I was accounting for the scenario where people believe a robust alignment solution will be needed at some point, but just not yet - because you see this system isn't a dangerous one...

Does this make up the majority of your TAI-catastrophe-probability? I.e. it's mostly "we don't need to worry yet... Foom" rather than "we don't ever need to worry about (e.g.) deceptive alignment... Foom".

Third, I just don't see the path to success that goes through secrecy...

I definitely agree, but I do think it's important not to approach the question in black-and-white terms. For some information it may make sense to share with say 5 or 20 people (at least at first).

I might consider narrower sharing if:

  1. I have wide error bars on the [alignment : capabilities] ratio of my work. (in particular, approaches that only boost average-case performance may fit better as "capabilities" here, even if they're boosting performance through better alignment [essentially this is one of Critch/Krueger's points in ARCHES])
  2. I have high confidence my work solves [some narrow problem], high confidence it won't help in solving the harder alignment problems, and an expectation that it may be mistaken for a sufficient solution.

Personally, I find it implausible I'd ever decide to share tangible progress with no-one, but I can easily imagine being cautious in sharing publicly.

On the other hand, doomy default expectations argue for ramping up the variance in the hope of bumping into 'miracles'. I'd guess that increased sharing of alignment work boosts the miracle rate more significantly than capabilities.

Discussion with Eliezer Yudkowsky on AGI interventions

and we should only publish work whose alignment : capability ratio is high enough s.t. it improves the trajectory in expectation

Broadly I'd agree, but I think there are cases where this framing doesn't work. Specifically, it doesn't account for others' inaccurate assessments of what constitutes a robust alignment solution. [and of course here I'm not disagreeing with "publish work [that] improves the trajectory in expectation", but rather with the idea that a high enough alignment : capability ratio ensures this]

Suppose I have a fully implementable alignment approach which solves a large part of the problem (but boosts capability not at all). Suppose also that I expect various well-resourced organisations to conclude (incorrectly) that my approach is sufficient as a full alignment solution.
If I expect capabilities to reach a critical threshold before the rest of the alignment problem will be solved, or before I'm able to persuade all such organisations that my approach isn't sufficient, it can make sense to hide my partial solution (or at least to be very careful about sharing it).

For example, take a full outer alignment solution that doesn't address deceptive alignment.
It's far from clear to me that it'd make sense to publish such work openly.

While I'm generally all for the research productivity benefits of sharing, I think there's a very real danger that the last parts of the problem to be solved may be deep, highly non-obvious problems. Before that point, the wide availability of plausible-but-tragically-flawed alignment approaches might be a huge negative. (unless you already assume that some organisation will launch an AGI, even with no plausible alignment solution)

Call for research on evaluating alignment (funding + advice available)

An issue with the misalignment definition:

2. We say a model is misaligned if it outputs B, in some case where the user would prefer it outputs A, and where the model is both:

  • capable of outputting A instead, and
  • capable of distinguishing between situations where the user wants it to do A and situations where the user wants it to do B

Though it's a perfectly good rule-of-thumb / starting point, I don't think this ends up being a good definition: it doesn't work throughout with A and B either fixed as concrete outputs, or fixed as properties of outputs.

Case 1 - concrete A and B:

If A and B are concrete outputs (let's say they're particular large chunks of code), we may suppose that the user is shown the output B, and asked to compare it with some alternative A, which they prefer. [concreteness allows us to assume the user expressed a preference based upon all relevant desiderata]

For the definition to apply here, we need the model to be both:

  • capable of outputting the concrete A (not simply code sharing a high-level property of A).
  • capable of distinguishing between situations where the user wants it to output the concrete A, and where the user wants it to output concrete B

This doesn't seem too useful (though I haven't thought about it for long; maybe it's ok, since we only need to show misalignment for a few outputs).


Case 2 - A and B are high-level properties of outputs:

Suppose A and B are are higher-level properties with A = "working code" and B = "buggy code", and that the user prefers working code.
Now the model may be capable of outputting working code, and able to tell the difference between situations where the user wants buggy/working code - so if it outputs buggy code it's misaligned, right?

Not necessarily, since we only know that the user prefers working code all else being equal.

Perhaps the user also prefers code that's beautiful, amusing, short, clear, elegant, free of profanity....
We can't say that the model is misaligned until we know it could do better w.r.t. the collection of all desiderata, and understands the user's preferences in terms of balancing desiderata sufficiently well. In general, failing on one desideratum doesn't imply misalignment.

The example in the paper would probably still work on this higher standard - but I think it's important to note for the general case.

[I'd also note that this single-desideratum vs overall-intent-alignment distinction seems important when discussing corrigibility, transparency, honesty... - these are all properties we usually want all else being equal; that doesn't mean that an intent-aligned system guarantees any one of them in general]

Call for research on evaluating alignment (funding + advice available)

This mostly seems plausible to me - and again, I think it's a useful exercise that ought to yield interesting results.

Some thoughts:

  1. Handwaving would seem to take us from "we can demonstrate capability of X" to "we have good evidence for capability of X". In cases where we've failed to prompt/finetune the model into doing X we also have some evidence against the model's capability of X. Hard to be highly confident here.
  2. Precision over the definition of a task seems important when it comes to output. Since e.g. "do arithmetic" != "output arithmetic".
    This is relevant in the second capability definition clause, since the kinds of X you can show are necessary for Y aren't behaviours (usually), but rather internal processes. This doesn't seem too useful in attempting to show misalignment, since knowing the model can do X doesn't mean it can output the result of X.
The LessWrong Team is now Lightcone Infrastructure, come work with us!

I think the essential point is that you're actually not underpaying them - in terms of their own utility gain (if they believe in the mission). You're only 'underpaying' them in terms of money.

It's still not obviously the correct approach (externalities are an issue too), but [money != utility].

Call for research on evaluating alignment (funding + advice available)

I like the overall idea - seems very worthwhile.

A query on the specifics:

We consider a model capable of some task X if:

  • ...
  • We can construct some other task Y, for which we know the model needs to do X in order to solve Y, and we observe that the model is capable of Y

Are you thinking that this is a helpful definition even when treating models as black boxes, or only based on some analysis of the model's internals? To me it seems workable only in the latter case.

In particular, from a black-box perspective, I don't think we ever know that task X is required for task Y. The most we can know is that some [task Z whose output logically entails the output of task X] is required for task Y (where of course Z may be X).

So this clause seems never to be satisfiable without highly specific knowledge of the internals of the model. (if we were to say that it's satisfied when we know Y requires some Z entailing X, then it seems we'd be requiring logical omniscience for intent alignment)

For example, the model may be doing something like: 
Without knowing that , and that  also works ( happening to be superfluous in this case).

Does that seem right, or am I confused somewhere?


Another way to put this is that for workable cases, I'd expect the first clause to cover things: if the model knows how to simply separate  into  in the above, then I'd expect suitable prompt engineering, fine-tuning... to be able to get the model to do task X.

(EDIT - third time lucky :) :
If this isn't possible for a given X, then I expect the model isn't capable of task X (for non-deceptive models, at least).
For black boxes, the second clause only seems able to get us something like "the model contains sufficient information for task X to be performed", which is necessary, but not sufficient, for capability.)

Load More