Ollie J — LessWrong

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Ollie J2y20

Fixed, thanks for flagging

Meta "open sources" LMs competitive with Chinchilla, PaLM, and code-davinci-002 (Paper)

Ollie J3y10

The link for the github repo is broken, it includes the comma at the end.

Human-level Diplomacy was my fire alarm

Ollie J3y92

I wonder how it would update its strategies if you negotiated in an unorthodox way:

"If you help me win, I will donate £5000 across various high-impact charities"
"If you don't help me win, I will kill somebody"

Contra Hofstadter on GPT-3 Nonsense

Ollie J4y290

There exist many articles like this littered throughout the internet, where authors perform surface-level analysis and ask GPT-3 some question (usually basic arithmetic), then point at the wrong answer and make some conclusion ("GPT-3 is clueless"). They almost never state the parameters of the used model or give the whole input prompt.

GPT-3 is very capable of saying "I don't know" (or "yo be real"), but due to its training dataset it likely won't say it on its own accord.

GPT-3 is not an oracle or some other kind of agent. GPT-3 is a simulator of such agents. To get GPT-3 to act as a truthful oracle, explicit instruction must be given in the input prompt to do so.

Meta wants to use AI to write Wikipedia articles; I am Nervous™

Ollie J4y60

I'm positive that as these language models become more accessible and powerful, their misuse will grow massively. However, I believe open sourcing is the best option here; having access to such model allows us to create accurate automatic classifiers that detect outputs from such models. Media websites (e.g. Wikipedia, Twitter) could include this classifier in their pipeline for submitting new media.

Making such technologies closed source leaves researchers in the dark; due to the scaling-transformer hype, only a tiny fraction of the world's population have the financial means to train a SOTA transformer model.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments