Wiki Contributions

Comments

So I ran a quick test (running llama.cpp perplexity command on wiki.test.raw )

  • base_model (Meta-Llama-3-8B-Instruct-Q6_K.gguf): PPL = 9.7548 +/- 0.07674
  • steered_model (llama-3-8b-instruct_activation_steering_q8.gguf): 9.2166 +/- 0.07023

So perplexity actually lowered, but that might be because the base model I used was more quantized. However, it is moderate evidence that the output quality decrease from activation steering is lower than that from Q8->Q6 quantisation.

I must say, I am a little surprised by what seems to be the low cost of activation editing. For context, many of the Llama-3 finetunes right now come with a measurable hit to output quality. Mainly because they are using worse fine tuning data, than the data llama-3 was originally fine tuned on.

maintaining model coherence

To determine this, I believe we would need to demonstrate that the score on some evaluations remains the same. A few examples don't seem sufficient to establish this, as it is too easy to fool ourselves by not being quantitative.

I don't think DanH's paper did this either. So I'm curious, in general, whether these models maintain performance, especially on measures of coherence.

In the open-source community, they show that modifications retain, for example, the MMLU and HumanEval score.

This is pretty good. It has a lot in it, being a grab bag of things. I particularly enjoyed the scalable oversight sections which succinctly explained debate, recursive reward modelling etc. There were also some gems I hadn't encountered before, like the concept of training out agentic behavior by punishing side-effects.

If anyone wants the HTML version of the paper, it is here.

Maybe our culture fits our status-seeking surprisingly well because our culture was designed around it.

We design institutions to channel and utilize our status-seeking instincts. We put people in status conscious groups like schools, platoons, or companies. There we have ceremonies and titles that draw our attention to status.

And this works! Ask yourself, is it more effective to educate a child individually or in a group of peers? The latter. Is it easier to lead a solitary soldier or a whole squad? The latter. Do people seek a promotion or a pay rise? Both, probably. The fact is, that people are easier to guide when in large groups, and easier to motivate with status symbols.

From this perspective, our culture and inclination for seeking status have developed in tandem, making it challenging to determine which influences the other more. However, it appears that culture progresses more rapidly than genes, suggesting that culture conforms to our genes, rather than the reverse.

Another perspective: Sometimes our status seeking is nonfunctional and therefore nonaligned. For example we also waste a lot of effort on status, which seems like a nonfunctional drive. People will compete for high status professions like musician, streamer, celebrity and most will fail, which makes it seem like an unwise investment of time. This seems misaligned, as it's not adaptive.

would probably eventually stop treating you as a source of new information once it had learned a lot from you, at which point it would stop being deferential.

It seems that 1) when extrapolating to new situations 2) if you add a term to decay the relevance of old information (pretty standard in RL) 3) or you add a minimum bounds to uncertainty then it would remain deferential.

In other words, it doesn't seem like an unsolvable problem, just an open question. But every other alignment agenda also has numerous open questions. So why the hostility.

Academia and LessWrong are two different groups, which have different cultures and jargon. I think they may be overly skeptical towards each other's work at times.

It's worth noting though that many of the nice deferential properties may appear in other value modelling techniques (like recursive reward modelling at OpenAI).

A nice introductory essay, seems valuable for entrants.

There quite a few approaches to alignment beyond CIRL and Value Lear ing. https://www.lesswrong.com/posts/zaaGsFBeDTpCsYHef/shallow-review-of-live-agendas-in-alignment-and-safety

There's some recent academic research on CIRL which is overlooked on LessWrong, Here we seem to only discuss Stuart Russell's work.

Recent work:

See also this overviews in lecture 3 and 4 of Roger Gross's CSC2547 Alignment Course.

The most interesting feature is deference: that you can have a pathologically uncertain agent that constantly seeks human input. As part of its uncertainty, It's also careful how it goes about this seeking of input. For example, if it's unsure if humans like to be stabbed (we don't), it wouldn't stab you to see your reaction, that would be risky! Instead, it would ask or seek out historical evidence.

This is an important safety feature which slows it down, grounds it, and helps avoid risky value extrapolation (and therefore avoids poor convergence).

It's worth noting that CIRL sometimes goes by other names

Inverse reinforcement learning, inverse planning, and inverse optimal control, are all different names for the same problem: Recover some specification of desired behavior from observed behavior

It's also highly related to both assistance games and Recursive Reward Modelling (part of OpenAI's superalignment).

On the other hand, there are some old rebuttals of parts of it

as long as training and eval error are similar

It's just that eval and training are so damn similar, and all other problems are so different't. So while it is technical not overfitting (to this problem), if is certainly overfitting to this specific problems, and it certainly isn't measuring generalization in any sense of the word. Certainly not in the sense of helping us debug alignment for all problems.

This is an error that, imo, all papers currently make though! So it's not a criticism so much as an interesting debate, and a nudge to use a harder test or OOD set in your benchmarks next time.

but you can't say they're more scalable than SAE, because SAEs don't have to have 8 times the number of features

Yeah, good point. I just can't help but think there must be a way of using unsupervised learning to force a compressed human-readable encoding. Going uncompressed just seems wasteful, and like it won't scale. But I can't think of a machine learnable, unsupervised learning, human-readable coding. Any ideas?

Interesting, got anymore? Especially for toddlers and so on, or would you go through everything those women have uploaded?

Load More