x

LESSWRONG

LW

KhromeM

KhromeM

Message

1

2

3y

KhromeM

1

3y

KhromeM — LessWrong

Using an LLM perplexity filter to detect weight exfiltration

My last statement was totally wrong. Thanks for catching that.

In theory its probably even possible to get the approximate weights by expending insane amounts of compute, but you could use those resources much more efficiently.

Using an LLM perplexity filter to detect weight exfiltration

I do not understand how you can extract weights through just conversing with an LLM any more than you can get information on how my neurons are structured by conversing with me. Extracting training data it has seen is one thing, but presumably it has never seen its weights. If the system prompts did not tell it it was an LLM, it should not even be able to figure out that.