I don't have a good system prompt that I like, although I am trying to work on one. It seems to me like the sort of thing that should be built in to a tool like this (perhaps with options, as different system prompts will be useful for different use-cases, like learning vs trying to push the boundaries of knowledge).

I would be pretty excited to try this out with Claude 3 behind it. Very much the sort of thing I was trying to advocate for in the essay!

Reply

LLMs for Alignment Research: a safety priority?

abramdemski17dΩ440

But not intentionally. It was an unintentional consequence of training.

Reply

LLMs for Alignment Research: a safety priority?

abramdemski17dΩ342

I am not much of a prompt engineer, I think. My "prompts" generally consist of many pages of conversation where I babble about some topic I am interested in, occasionally hitting enter to get Claude's responses, and then skim/ignore Claude's responses because they are bad, and then keep babbling. Sometimes I make an explicit request to Claude such as "Please try and organize these ideas into a coherent outline" or "Please try and turn this into math" but the responses are still mostly boring and bad.

I am trying ;p

But yes, it would be good for me to try and make a more concrete "Claude cannot do X" to get feedback on.

Reply

LLMs for Alignment Research: a safety priority?

abramdemski17d42

I've tried writing the beginning of a paper that I want to read the rest of, but the LLM did not complete it well enough to be interesting.

Reply

LLMs for Alignment Research: a safety priority?

abramdemski17dΩ330

I agree with this worry. I am overall advocating for capabilitarian systems with a specific emphasis in helping accelerate safety research.

Reply

LLMs for Alignment Research: a safety priority?

abramdemski22dΩ220

Sounds pretty cool! What LLM powers it?

Reply

LLMs for Alignment Research: a safety priority?

abramdemski22dΩ550

I don't think the plan is "turn it on and leave the building" either, but I still think the stated goal should not be automation.

I don't quite agree with the framing "building very generally useful AI, but the good guys will be using it first" -- the approach I am advocating is not to push general capabilities forward and then specifically apply those capabilities to safety research. That is more like the automation-centric approach I am arguing against.

Hmm, how do I put this...

I am mainly proposing more focused training of modern LLMs with feedback from safety researchers themselves, toward the goal of safety researchers getting utility out of these systems; this boosts capabilities for helping-with-safety-research specifically, in a targeted way, because that is what you are getting more+better training feedback on. (Furthermore, checking and maintaining this property would be an explicit goal of the project.)

I am secondarily proposing better tools to aid in that feedback process; these can be applied to advance capabilities in any area, I agree, but I think it only somewhat exacerbates the existing "LLM moderation" problem; the general solution of "train LLMs to do good things and not bad things" does not seem to get significantly more problematic in the presence of better training tools (perhaps the general situation even gets better). If the project was successful for safety research, it could also be extended to other fields. The question of how to avoid LLMs being helpful for dangerous research would be similar to the LLM moderation question currently faced by Claude, ChatGPT, Bing, etc: when do you want the system to provide helpful answers, and when do you want it to instead refuse to help?

I am thirdly also mentioning approaches such as training LLMs to interact with proof assistants and intelligently decide when to translate user arguments into formal languages. This does seem like a more concerning general-capability thing, to which the remark "building very generally useful AI, but the good guys will be using it first" applies.

Reply

Modern Transformers are AGI, and Human-Level

abramdemski1mo42

No, I was talking about the results. lsusr seems to use the term in a different sense than Scott Alexander or Yann LeCun. In their sense it's not an alternative to backpropagation, but a way of constantly predicting future experience and to constantly update a world model depending on how far off those predictions are. Somewhat analogous to conditionalization in Bayesian probability theory.

I haven't watched the LeCun interview you reference (it is several hours long, so relevant time-stamps to look at would be appreciated), but this still does not make sense to me -- backprop already seems like a way to constantly predict future experience and update, particularly as it is employed in LLMs. Generating predictions first and then updating based on error is how backprop works. Some form of closeness measure is required, just like you emphasize.

Reply

Modern Transformers are AGI, and Human-Level

abramdemski1moΩ460

Yeah, I didn't do a very good job in this respect. I am not intending to talk about a transformer by itself. I am intending to talk about transformers with the sorts of bells and whistles that they are currently being wrapped with. So not just transformers, but also not some totally speculative wrapper.

Reply