Wiki Contributions

Comments

They cover a lot of different territory. Maybe 20% of the essays between the two giant books are same. The Goddess book is all fiction, so it's pretty distinct from the other two. I'd recommend the goddess book if you're getting someone a gift.

Between the two 800-pagers, the Library of Scott Alexandria was written in 2014, so it's more classic Scott, and more focused on rationality specifically. I would personally prefer the newer SSC Abridged.

I used Lulu. Printing your book would be $40-50. Also though 900 pages is just a wee too big!, the page limit is 800 pages.

Oh awkward, yeah I'm missing the link for SSC Abridged. Here's The Library of Scott Alexandria too in case you're having trouble finding it.

I thought about this a bit and have a few suggestions.

1. You could try using a ctrl or start-of-sentence token to distinguish text generated by the model from the prompt (see https://arxiv.org/abs/1909.05858 for the terminology “ctrl”). If you decorated every prompt to look like [prompt_token1, prompt_token2,…prompt_tokenn, HARMLESS_MODEL_START], then the model would better be able to compartmentalize between prompts it’s being fed and and what it’s asked to generate. This would also let you train on the prompt tokens, so you’d pay less of a safety penalty. Another important advantage is that this would enable you do use the model interactively, rather than as something that gives one-off completions. Relatedly you could then play around with prompt tuning.

2. Train a harmfulness classifier with causal aka left-to-right aka autoregressive masking, so that you can then backpropagate through to the generator, rather than have to do some kind of rl thing.

3. Indeed, the generator itself could double as this autoregressive classifier. This would make it more interpretable when the model thinks it’s saying something violent, and how it feels about the alternative options for each token. This could also help you find more probably violent samples, since you can upsample tokens which have a high (violence * probability) score.

4. I wouldn’t expect this next approach to scale to more agenty models that aren’t language models, but one thing you could to do complement 1 and 3 is to also train the model (or another language model) to generate violent completions on demand with a VIOLENT_MODEL_START ctrl token. You could then do things like look at the relative probability of a sample being generated from the violent model vs the harmless model. Maybe something interesting happens when both of those probability are high compared to a baseline NEUTRAL_MODEL_START’s probability of generating that sample. You could also add an auxiliary loss term to distance the harmless model from the harmful one e.g. log(P(token; M_harmful))