LESSWRONG
LW

David Johnston
520Ω3101920
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Foom & Doom 1: “Brain in a box in a basement”
David Johnston13d110

This piece combines relatively uncontroversial points with some justification ("we're not near the compute or data efficiency limit") with controversial claims justified only by Steven's intuition ("the frontier will be reached suddenly by a small group few people are tracking"). I'd be more interested in a piece which examined the consequences of the former kind of claims only, or more strongly justified the latter kinds of claims.

Reply1
Interpretability Will Not Reliably Find Deceptive AI
David Johnston2mo10

models will have access to some kind of "neuralese" that allows them to reason in ways we can't observe

Only modest confidence, but while there's an observability gap between neuralese and CoT monitoring, I suspect it's smaller than the gap between reasoning traces that haven't been trained against oversight and reasoning traces that have.

Reply
Societal and technological progress as sewing an ever-growing, ever-changing, patchy, and polychrome quilt
David Johnston2mo10
  1. I mean, even if you're mostly pursuing a particular set of final values (which is not what you're advocating here), there are probably strong reasons to make coordination a high priority (which is close to what you're advocating here).

  2. Well, I did say "to the extent permitted by 1" - there's probably conflict here - but I wasn't suggesting CEV as something that makes coordination easy. I'm saying it's a good principle for judging final outcomes between two different paths that have similar levels of coordination. Ofc we'd have to estimate the "happiness in hindsight", but this looks tractable to me.

Reply
$500 Bounty Problem: Are (Approximately) Deterministic Natural Latents All You Need?
David Johnston2mo50

I've thought about it a bit, I have a line of attack for a proof, but there's too much work involved in following it through to an actual proof so I'm going to leave it here in case it helps anyone.

I'm assuming everything is discrete so I can work with regular Shannon entropy.

Consider the range R1 of the function g1:λ↦P(X1|Λ=λ) and R2 defined similarly. Discretize R1 and R2 (chop them up into little balls). Not sure which metric to use, maybe TV.

Define Λ′1(λ) to be the index of the ball into which P(X1|Λ=λ) falls, Λ′2 similar. So if d(P(X1|Λ=a),P(X1|Λ=b)) is sufficiently small, then Λ′1(a)=Λ′1(b).

By the data processing inequality, conditions 2 and 3 still hold for Λ′=(Λ′1,Λ′2). Condition 1 should hold with some extra slack depending on the coarseness of the discretization.

It takes a few steps, but I think you might be able to argue that, with high probability, for each X2=x2, the random variable Q1:=P(X1|Λ′1) will be highly concentrated (n.b. I've only worked it through fully in the exact case, and I think it can be translated to the approximate case but I haven't checked). We then invoke the discretization to argue that H(Λ′1|X1) is bounded. The intuition is that the discretization forces nearby probabilities to coincide, so if Q1 is concentrated then it actually has to "collapse" most of its mass onto a few discrete values.

We can then make a similar argument switching the indices to get H(Λ′2|X2) bounded. Finally, maybe applying conditions 2 and 3 we can get H(Λ′1|X2) bounded as well, which then gives a bound on H(Λ|Xi).

I did try feeding this to Gemini but it wasn't able to produce a proof.

Reply
$500 Bounty Problem: Are (Approximately) Deterministic Natural Latents All You Need?
David Johnston2mo65

Wait, I thought the first property was just independence, not also identically distributed.

In principle I could have e.g. two biased coins with their biases different but deterministically dependent.

Reply
Societal and technological progress as sewing an ever-growing, ever-changing, patchy, and polychrome quilt
David Johnston2mo32

I think:

  1. Finding principles for AI "behavioural engineering" that reduces people's desire to engage in risky races (e.g. because they find the principles acceptable) seems highly valuable
  2. To the extent permitted by 1, pursuing something CEV like ("we're happier with the outcome in hindsight than we would've been with other outcomes") seems desirable also

I sort of see the former as potentially encouraging diversity (because different groups want different things, and are most likely to agree to "everyone gets what they want"), but the latter may in fact suggest convergence (because, perhaps, there are fairly universal answers to "what makes people happy with the benefit of hindsight?").

You stress the importance of having robust feedback procedures, but having overall goals like this can help to judge which procedures are actually doing what we want.

Reply
$500 Bounty Problem: Are (Approximately) Deterministic Natural Latents All You Need?
David Johnston3mo*10

Your natural latents seem to be quite related to the common construction IID variables conditional on a latent - in fact, all of your examples are IID variables (or "bundles" of IID variables) conditional on that latent. Can you give me an interesting example of a natural latent that is not basically the conditionally IID case?

(I was wondering if the extensive literature on the correspondence between De Finetti type symmetries and conditional IID representations is of any help to your problem. I'm not entirely sure if it is, given that mostly addresses the issue of getting from a symmetry to a conditional independence, whereas you want to get from one conditional independence to another, but it's plausible some of the methods are applicable)

Reply
RL, but don't do anything I wouldn't do
David Johnston7mo30

If you're in a situation where you can reasonably extrapolate from past rewards to future reward, you can probably extrapolate previously seen "normal behaviour" to normal behaviour in your situation. Reinforcement learning is limited - you can't always extrapolate past reward - but it's not obvious that imitative regularisation is fundamentally more limited.

(normal does not imply safe, of course)

Reply
RL, but don't do anything I wouldn't do
David Johnston7mo50

Their empirical result rhymes with adversarial robustness issues - we can train adversaries to maximise ~arbitrary functions subject to small perturbation from ground truth constraints. Here the maximised function is a faulty reward model and the constraint is KL to a base model instead of distance to a ground truth image.

I wonder if multiscale aggregation could help here too as it does with image adversarial robustness. We want the KL penalty to ensure that the generations should look normal at any "scale", whether we look at them token by token or read a high-level summary of them. However, I suspect their "weird, low-KL" generations will have weird high-level summaries, whereas more desired policies would look more normal in summary (though it's not immediately obvious if this translates to low and high probability summaries respectively - one would need to test). I think a KL penalty to the "true base policy" should operate this way automatically, but as the authors note we can't actually implement that.

Reply
Model Integrity: MAI on Value Alignment
David Johnston7mo30

Is your view closer to:

  • there's two hard steps (instruction following, value alignment), and of the two instruction following is much more pressing
  • instruction following is the only hard step; if you get that, value alignment is almost certain to follow
Reply
Load More
No wikitag contributions to display.
7A brief theory of why we think things are good or bad
9mo
10
11Mechanistic Anomaly Detection Research Update
1y
0
6Opinion merging for AI control
2y
0
11Is it worth avoiding detailed discussions of expectations about agency levels of powerful AIs?
Q
2y
Q
6
-1How likely are malign priors over objectives? [aborted WIP]
3y
0
8When can a mimic surprise you? Why generative models handle seemingly ill-posed problems
3y
4
3There's probably a tradeoff between AI capability and safety, and we should act like it
3y
3
3Is evolutionary influence the mesa objective that we're interested in?
3y
2
2[Cross-post] Half baked ideas: defining and measuring Artificial Intelligence system effectiveness
3y
0
5Are there any impossibility theorems for strong and safe AI?
Q
3y
Q
3
Load More