abramdemski

Sequences

Pointing at Normativity
Consequences of Logical Induction
Partial Agency
Alternate Alignment Ideas
Filtered Evidence, Filtered Arguments
CDT=EDT?
Embedded Agency
Hufflepuff Cynicism

Comments

Sorted by

Ah, yeah, sorry. I do think about this distinction more than I think about the actual model-based vs model-free distinction as defined in ML. Are there alternative terms you'd use if you wanted to point out this distinction? Maybe policy-gradient vs ... not policy-gradient?

In machine-learning terms, this is the difference between model-free learning (reputation based on success/failure record alone) and model-based learning (reputation can be gained for worthy failed attempts, or lost for foolish lucky wins).

There's unpublished work about a slightly weaker logical induction criterion which doesn't have this property (there exist constant-distribution inductors in this weaker sense), but which is provably equivalent to the regular LIC whenever the inductor is computable.[1] To my eye, the weaker criterion is more natural. The basic idea is that this weird trader shouldn't count as raking in the cash. The regular LIC (we can call it "strong LIC" or SLIC) counts traders as exploiting the market if there is a sequence of worlds in which their wealth grows unboundedly. This allows for the trick you quote: buying up larger and larger piles of sentences in diminishingly-probable worlds counts as exploiting the market.

The weak LIC (WLIC) says instead that traders have to actually make the money in order to count as exploiting the market.

Thus the limit of a logical inductor can count as a (weak) logical inductor, just not a computable one.

  1. ^

    Roughly speaking. This is not quite an adequate description of the theorem.

Good citation. Yeah, I should have flagged harder that my description there was a caricature and not what anyone said at any point. I still need to think more about how to revise the post to be less misleading in this respect.

One thing I can say is that the reason that quote flags that particular failure mode is because, according to the MIRI way of thinking about the problem, that is an easy failure mode to fall into. 

Answer by abramdemski50

I agree that:

  • People commonly talk about love as black-and-white when it isn't.
  • It is used as a semantic stopsign or applause light.
  • Love is a cluster concept made up of many aspects, but the way people talk about love obscures this. The concept of "true love" serves to keep people in the dark by suggesting that the apparent grab-bag nature of love is just the way people talk, & there's an underlying truth to be discovered. Because experiences of love are fairly rare, people can only accumulate personal evidence about this slowly, so it remains plausible that what they've experienced is "just infatuation" or something else, and the "true love" remains to be discovered. People will fight about what love is in sideways ways, EG "you don't know what love is" can be used to undercut the other person's authority on love (when a more accurate representation of the situation might be "I didn't enjoy participating in the social dynamic you're presently calling love"). Different people will sometimes express different detailed views about what love is, but there seems to be little drive to reach an agreement about this, at least between people who aren't romantic partners. 

I once had the experience of a roommate asking if I believe in love. I said yes, absolutely, explaining that it is a real human emotion (I said something about chemicals in the brain). He responded: it sounds like you don't believe in love!

(I think this has to do with ambiguity about believe-in as much as about love, to be fair.)

So, a question: is 'love' worth saving as a concept? The way some people use the word might be pretty terrible, but, should we try to rescue it? Is there a good way to use it which recovers many aspects of the common use, while restoring coherence?

I do personally feel that there is some emotional core to love, so I'm sympathetic to the "it's a specific emotion" definition. This accounts with people not being able to give a specific definition. Emotions are feelings, so they're tough to define. They just feel a specific way. You can try to give cognitive-behavioral definitions, and that's pretty useful; but, for example, you could show behavioral signs of being afraid without actually experiencing fear.

I'm also sympathetic to the view that "love" is a kind of ritual, with saying "I love you" being the centerpiece of the ritual, and a cluster of loving behaviors being the rest of it. "I love you" can then be interpreted as a desire or intention or commitment to participate in the rest of it. 

I would strongly guess that many people could physically locate the cringe pain, particularly if asked when they're experiencing it.

I'm specifically referring to this answer, combined with a comment that convinced me that the o1 deception so far is plausibly just a capabilities issue:

https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#L5WsfcTa59FHje5hu

https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#xzcKArvsCxfJY2Fyi

I continue not to get what you're saying. I entirely agree with the Habryka and Acertain responses to Gwern. Ted Sanders is talking about why GPT models hallucinate, but fails to address the question of why o1 actively encourages itself to hallucinate (rather than, eg, telling the user that it doesn't know). My most plausible explanation of your position is that you think this is aligned, IE, you think providing plausible URLs is the best it can do when it doesn't know. I disagree: it can report that it doesn't know, rather than hallucinate answers. Actively encouraging itself to give plausible-but-inaccurate links is misalignment. It's not just missing a capability; it's also actively self-prompting to deceive the user in a case where it seems capable of knowing better (if it were deliberating with the user's best interest in mind).

As I said above, the more likely explanation is that there's an asymmetry in capabilities that's causing the results, where just knowing what specific URL the customer wants doesn't equal the model having the capability to retrieve a working URL, so this is probably at the heart of this behavior.

Capabilities issues and alignment issues are not mutually exclusive. It appears that capabilities issues (not being able to remember all valid URLs) is causing an alignment issue in this case (actively planning to give deceptive URLs rather than admit its ignorance). This causes me to predict that even if this particular capability issue is resolved, the training will still have this overall tendency to turn remaining capability issues into alignment issues. How are you seeing this differently?

Yep, thanks, this makes sense. I still don't have an intuitive picture of what differences to expect in the trained result, though.

Clarifying Concrete Algorithms

In the most basic possible shoggoth+face training setup, you sample CoT+summary, you score the summary somehow, and then you do this lots of times and throw away a worse-scoring fraction of these samples. Then you fine-tune the shoggoth on the CoT parts of these, and fine-tune the face on the summary part. I'll call this 'basic trainig'.

The most basic implementation of your thing is similar, except it discards the worst in this more complex way you've defined, instead of just global score. I'll call this 'your proposal'. (Although, it ignores the contrastive part of your proposal.)

(For both basic training and your proposal, I'm going to pretend there is only one question/prompt; really we have across-prompt issues similar to the across-summary issues you're trying to solve, IE, by discarding the globally worst-scoring CoT+summary we discard answers to hard questions, meaning we would never train to do better on those questions. So really we need to at least take multiple samples for each question and discard samples based on performance relative to a question. Easier to focus on one question to avoid describing too many algorithmic details.)

Your proposal splits into the CoT part (which you call the n part), and the summary part (which you call the k part).

Analyzing CoT Gradient

For the CoT part, basic training keeps the CoTs which resulted in the best summaries. Your proposal instead averages the top 3 summaries for each CoT. This de-noises somewhat, at the cost of taking more summary samples per CoT. I'm not sure how to think about the trade-off there. More summary samples per CoT means less CoT samples overall, so we're doing a better job scoring the CoTs, but we get less scores overall, so we learn less. Maybe the gradient step-size can be increased a little because we're more confident in the gradient steps being taken, but we still explore less of the CoT space.

Another variation would be to average all the summary samples for that CoT to score a CoT sample. This de-noises more, for the same price in number of samples. This way, we're simply scoring CoTs based on how well they make the summaries perform, with several samples to de-noise the gradient.

A possible argument for your proposal over this alternative is that we're also training the summary to improve it, so the top 3 summaries are actually a better estimator of the kind of performance we care about than the total average.

This argument, if valid, similarly recommends your proposal over the basic training, because basic training is also looking at average summary score per-CoT. (It's also looking at top, but it only samples one sumary per CoT, so these are the same).

Analyzing Summary Gradient

For the summary-gradient part, you take the highest-scoring summary from the best CoT (as defined previously), rather than the highest-scoring summary overall. This could help train the summary to be good for the type of CoT we're steering towards, instead of than training it to be good for high-variance CoTs which produce one outlier great summary. As with the previous section, this helps to steer towards policies which are going to be better after learning, rather than just what's best currently.

Conclusion

I'm curious to hear whether you think I'm missing any big factors here. I don't have a lot of practical experience with this type of machine learning, so my guesses about what's important vs unimportant here are not very informed.

My uninformed guess is that the trade-off between denoising the gradient vs getting more samples favors getting more samples, so the denoising argument for your proposal isn't a great one. The main argument in favor, then, is the way your proposal trains the face to respond well to how the shoggoth is expected to behave after training, and trains the shoggoth to respond well to how the face is expected to behave after training -- whereas the basic proposal instead trains the face to perform well for the current shoggoth and vice versa.

I'm not sure how big of a deal this is. If the gradient steps are small and the training is iterated a lot of times, maybe it's not a very big deal at all? 

IE, for the CoT gradient, the mistake made by the basic proposal is that it will sometimes discard a good CoT (and gradient towards a worse one) based on random poor performance from face. However, maybe face's performance will improve fast enough that this won't matter too much in the end?

Similarly, for the summary gradient, the problem with the basic proposal is that it'll sometimes update based on an anomalously good summary of a poor CoT, meaning it is learning to cope with bad CoT behavior which it won't have to deal with very often post-training. But maybe it won't matter too much, if the bad CoT behavior is eliminated during training anyway?

Overall, it seems plausible to me that these problems with the basic proposal would work against what Daniel wants, and your proposal would help push towards what Daniel wants. However, I'm very uncertain, especially about the size of the effect.

Load More