LESSWRONG
LW

janos — LessWrong

Abstract

Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a human judge; consultancy, where a single AI tries to convince a human judge that asks questions; and compare to a baseline of direct question-answering, where the human judge just answers outright without the AI. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include... (read 1890 more words →)

Power-seeking can be probable and predictive for trained agents

Vika

Vika, janos

Power-seeking is a major source of risk from advanced AI and a key element of most threat models in alignment. Some theoretical results show that most reward functions incentivize reinforcement learning agents to take power-seeking actions. This is concerning, but does not immediately imply that the agents we train will seek power, since the goals they learn are not chosen at random from the set of all possible rewards, but are shaped by the training process to reflect our preferences. In this work, we investigate how the training process affects power-seeking incentives and show that they are still likely to hold for trained agents under some assumptions (e.g. that the agent learns... (read 2430 more words →)

Replying toProblems with learning values from observation

janos9y

Problems with learning values from observation

Is there a reason to think this problem is less amenable to being solved by complexity priors than other learning problems? / Might we build an unaligned agent competent enough to be problematic without solving problems similar to this one?

Replying toLearning Mathematics in Context

janos10y

Learning Mathematics in Context

What is Mathematics? by Courant and Robbins is a classic exploration that goes reasonably deep into most areas of math.

Replying toSuperintelligence 8: Cognitive superpowers

janos11y

Superintelligence 8: Cognitive superpowers

This makes me think of two very different things.

One is informational containment, ie how to run an AGI in a simulated environment that reveals nothing about the system it's simulated on; this is a technical challenge, and if interpreted very strictly (via algorithmic complexity arguments about how improbable our universe is likely to be in something like a Solomonoff prior), is very constraining.

The other is futurological simulation; here I think the notion of simulation is pointing at a tool, but the idea of using this tool is a very small part of the approach relative to formulating a model with the right sort of moving parts. The latter has been tried with various simple models (eg the thing in Ch 4); more work can be done, but justifying the models&priors will be difficult.

Replying toWhy IQ shouldn't be considered an external factor

janos11y

Why IQ shouldn't be considered an external factor

Certainly, interventions may be available, just as for anything else; but it's not fundamentally more accessible or malleable than other things.

Replying toWhy IQ shouldn't be considered an external factor

janos11y

Why IQ shouldn't be considered an external factor

I'm arguing that the fuzzy-ish definition that corresponds to our everyday experience/usage is better than the crisp one that doesn't.

Re IQ and "way of thinking", I'm arguing they both affect each other, but neither is entirely under conscious control, so it's a bit of a moot point.

Apropos the original point, under my usual circumstances (not malnourished, hanging out with smart people, reading and thinking about engaging, complex things that can be analyzed and have reasonable success measures, etc), my IQ is mostly not under my control. (Perhaps if I was more focused on measurements, nootropics, and getting enough sleep, I could increase my IQ a bit; but not very much, I think.) YMMV.

Replying toWhy IQ shouldn't be considered an external factor

janos11y

Why IQ shouldn't be considered an external factor

I think what you're saying is that if we want a coherent, nontrivial definition of "under our control" then the most natural one is "everything that depends on the neural signals from your brain". But this definition, while relatively clean from the outside, doesn't correspond to what we ordinarily mean; for example, if you have a mental illness, this would suggest that "stop having that illness!!" is reasonable advice, because your illness is "under your control".

I don't know enough neuroscience to give this a physical backing, but there are certain conscious decisions or mental moves that feel like they're very much under my control, and I'd say the things under my control... (read more)

Replying toHarry Potter and the Methods of Rationality discussion thread, February 2015, chapter 113

janos11y

Harry Potter and the Methods of Rationality discussion thread, February 2015, chapter 113

March 2nd isn't a Tuesday; is it Monday night or Tuesday night?

Replying toHow many words do we have and how many distinct concepts do we have?

janos11y

How many words do we have and how many distinct concepts do we have?

If you want to discuss the nature of reality using a similar lexicon to what philosophers use, I recommend consulting the Stanford Encyclopedia of Philosophy: http://plato.stanford.edu/

Replying toLink: Elon Musk wants gov't oversight for AI

janos11y

Link: Elon Musk wants gov't oversight for AI

Musk has joined the advisory board of FLI and CSER, which are younger sibling orgs of FHI and MIRI. He's aware of the AI xrisk community.

Replying to[MIRIx Cambridge MA] Limiting resource allocation with bounded utility functions and conceptual uncertainty

janos11y

[MIRIx Cambridge MA] Limiting resource allocation with bounded utility functions and conceptual uncertainty

Cool. Regarding bounded utility functions, I didn't mean you personally, I meant the generic you; as you can see elsewhere in the thread, some people do find it rather strange to think of modelling what you actually want as a bounded utility function.

This is where I thought you were missing the point:

Or you might say it's a suboptimal outcome because you just know that this allocation is bad, or something. Which amounts to saying that actually you know what the utility function should be and it isn't the one the analysis assumes.

Sometimes we (seem to) have stronger intuitions about allocations than about the utility function itself, and parlaying that to identify what... (read more)

LESSWRONG
LW

LESSWRONG
LW

janos

janos

janos

On scalable oversight with weak LLMs judging strong LLMs

Power-seeking can be probable and predictive for trained agents

janos

janos

janos

On scalable oversight with weak LLMs judging strong LLMs

Power-seeking can be probable and predictive for trained agents

Abstract