Osaid Nasir — LessWrong

since accuracies aren't near-100% we know there are some cases the model hasn't memorized, so the mechanism you suggest doesn't apply to those inputs

That makes sense.

I suspect the prompts are a bigger deal

Do you suppose a suitable proxy for prompt quality can be replicating these experiments with LLM debaters/judges of different sizes? Let's say P is the optimal prompt and Q is a suboptimal one, then LLM performance with prompt Q <= LLM performance with prompt P <= bigger LLM performance with prompt Q.

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Osaid Nasir1y10

Oh that's interesting. Wouldn't that slightly bias the results? For eg. the paper claims no advantage of debate over QA without article. Intuitively if the weak LLM isn't pretrained on QA without article then debate should work better than consultancy. On the other hand, if it is, then intuitively there should be no difference between Debate and Consultancy which is what the team observes. Wdyt?

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Osaid Nasir1y10

Ah that makes sense, thank you.
Did the team also ensure that there wasn't any data leakage between the tasks being evaluated and the pretraining data? For context, I'm thinking of replicating the results with Llama so wondering about the same.

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Osaid Nasir1y10

My apologies I didn't frame my question correctly.

Our current work is looking into training our LLM judges to be better proxies of human judges

My understanding from this statement is that the team plans to finetune Weak LLMs on human judges and then use them as a judge for Strong LLM Debates. This makes sense right now, when human judges are able to assess Strong LLM Debates fairly robustly.

What happens when we want to use a Weak LLM as a judge but there is no accurate or good enough human judge? At that point we won't be able to finetune the Weak LLM because there is no good human judge. Do we assume that at that stage the Weak LLM itself will be pretty robust?

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Osaid Nasir1y10

Our current work is looking into training our LLM judges to be better proxies of human judges

How does this scale to superintelligent AI capabilities? Wouldn't Debate be severely restricted by a lack of accurate human judges at that point? Or is the idea akin to Weak to Strong generalisation wherein the human judge can act like a weak teacher judge at that point.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments