LESSWRONG
LW

2925
Maximilian Kaufmann
184Ω7040
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
William_S's Shortform
Maximilian Kaufmann8mo63

I'd like to use this from your description, but tough to trust / understand without a README!

Reply
Proposal: we should start referring to the risk from unaligned AI as a type of *accident risk*
Maximilian Kaufmann2y32

A point against that particular terminology which you might find interesting. https://www.lesswrong.com/posts/6bpW2kyeKaBtuJuEk/why-i-hate-the-accident-vs-misuse-ai-x-risk-dichotomy-quick 

Reply
Basic Question about LLMs: how do they know what task to perform
Answer by Maximilian KaufmannJan 14, 202332

To partially answer your question ( I think the answer to "What is happening inside the LLM when it 'switches' to one task or another?" is pretty much "We don't know"), techniques such as RLHF (which nowadays are applied to pretty much any public-facing model you are likely to interact with) cause the model to act less like something searching for the most likely completion of this sentence on the internet, and more like something which is trying to answer your questions. These models would take "question" interpretation over the "autocomplete" one. 

A purely pretrained model might be more likely to do something like output a probability distribution over the first token which is split over both possible continuations, and then would stick with whichever interpretetation it happened to generate as it autoregresses on its own output.

Reply
More examples of goal misgeneralization
Maximilian Kaufmann3yΩ8150

How hard was it to find the examples of goal misgeneralization? Did the results take much “coaxing”?

Reply
121Paper: LLMs trained on “A is B” fail to learn “B is A”
Ω
2y
Ω
74
109Paper: On measuring situational awareness in LLMs
Ω
2y
Ω
17