abramdemski

Sequences

Pointing at Normativity
Consequences of Logical Induction
Partial Agency
Alternate Alignment Ideas
Filtered Evidence, Filtered Arguments
CDT=EDT?
Embedded Agency
Hufflepuff Cynicism

Comments

Which part of LLM? Shoggoth or simulacra? As I see it, there is a pressure on shoggoth to become very good at simulating exactly correct human in exactly correct situation, which is extremely complicated task.

Yes, I think it is fair to say that I meant the Shoggoth part, although I'm a little wary of that dichotomy utilized in a load-bearing way.

But I still don't see how this leads to strategic planning or consequentialist reasoning on shoggoth's part. It's not like shoggot even "lives" in some kind of universe with linear time or gets any reward for predicting the next token, or learns on its mistakes. It is architecturally an input-output function where input is whatever information it has about previous text and output is whatever parameters the simulation needs right now. It is incredibly "smart", but not agent kind of smart. I don't see any room for shoggoth's agency in this setup.

No room for agency at all? If this were well-reasoned, I would consider it major progress on the inner alignment problem. But I fail to follow your line of thinking. Something being architecturally an input-output function seems not that closely related to what kind of universe it "lives" in. Part of the lesson of transformer architectures, in my view at least, was that giving a next-token-predictor a long input context is more practical than trying to train RNNs. What this suggests is that given a long context window, LLMs reconstruct the information which would have been kept around in a recurrent state pretty well anyway.

This makes it not very plausible that the key dividing line between agentic and non-agentic is whether the architecture keeps state around. 

The argument I sketched as to why this input-output function might learn to be agentic was that it is tackling an extremely complex task, which might benefit from some agentic strategy. I'm still not saying such an argument is correct, but perhaps it will help to sketch why this seems plausible. Modern LLMs are broadly thought of as "attention" algorithms, meaning they decide what parts of sequences to focus on. Separately, many people think it is reasonable to characterize modern LLMs as having a sort of world-model which gets consulted to recall facts. Where to focus attention is a consideration which will have lots of facets to it, of course. But in a multi-stage transformer, isn't it plausible that the world-model gets consulted in a way that feeds into how attention is allocated? In other words, couldn't attention-allocation go through a relatively consequentialist circuit at times, which essentially asks itself a question about how it expects things to go if it allocates attention in different ways?

Any specific repeated calculation of that kind could get "memorized out", replaced with a shorter circuit which simply knows how to proceed in those circumstances. But it is possible, in theory at least, that the more general-purpose reasoning, going through the world-model, would be selected for due to its broad utility in a variety of circumstances.

Since the world-model-consultation is only selected to be useful for predicting the next token, the consequentialist question which the system asks its world-model could be fairly arbitrary so long as it has a good correlation with next-token-prediction utility on the training data.

Is this planning? IE does the "query to the world-model" involve considering multiple plans and rejecting worse ones? Or is the world-model more of a memorized mess of stuff with no "moving parts" to its computation? Well, we don't really know enough to say (so far as I am aware). Input-output type signatures do not tell us much about the simplicity or complexity of calculations within. "It's just circuits" but large circuits can implement some pretty sophisticated algorithms. Big NNs do not equal big lookup tables.

I intended the three to be probability and utility and steam, but it might make more sense to categorize things in other ways. While I still think there might be something more interesting here, I nowadays mainly think of Steam as the probability distribution over future actions and action-related concepts. This makes Steam an epistemic object, like any other belief, but with more normative/instrumental content because it's beliefs about actions, and because there will be a lot of FixDT stuff going on in such beliefs. Kickstarter / "belief-in" dynamics also seem extremely relevant.

Here are some different things that come to mind.

  1. As you mention, the simulacra behaves in an agentic way within its simulated environment, a character in a story. So the capacity to emulate agency is there. Sometimes characters can develop awareness that they are a character in a story. If an LLM is simulating that scenario, doesn't it seem appropriate (at least on some level) to say that there is real agency being oriented toward the real world? This is "situational awareness".
  2. Another idea is that the LLM has to learn some strategic planning in order to direct its cognitive resources efficiently toward the task of prediction. Prediction is a very complicated task, so this meta-cognition could in principle become arbitrarily complicated. In principle we might expect this to converge toward some sort of consequentialist reasoning, because that sort of reasoning is generically useful for approaching complex domains. The goals of this consequentialist reasoning do not need to be exactly "predict accurately" however; they merely need to be adequately aligned with this in the training distribution.
  3. Combining #1 and #2, if the model gets some use out of developing consequentialist metacognition, and the pseudo-consequentialist model used to simulate characters in stories is "right there", the model might borrow it for metacognitive purposes. 

The frame I tend to think about it with is not exactly "how does it develop agency" but rather "how is agency ruled out". Although NNs don't neatly separate into different hypotheses (eg, circuits can work together rather than just compete with each other) it is still roughly right to think of NN training as rejecting lots of hypotheses and keeping around lots of other hypotheses. Some of these hypotheses will be highly agentic; we know NNs are capable of arriving at highly agentic policies in specific cases. So there's a question of whether those hypotheses can be ruled out in other cases. And then there's the more empirical question of, if we haven't entirely ruled out those agentic hypotheses, what degree of influence do they realistically have?

Seemingly the training data cannot entirely rule out an agentic style of reasoning (such as deceptive alignment), since agents can just choose to behave like non-agents. So, the inner alignment problem becomes: what other means can we use to rule out a large agentic influence? (Eg, can we argue that simplicity prior favors "honest" predictive models over deceptively aligned agents temporarily playing along with the prediction game?) The general concern is: no one has yet articulated a convincing answer, so far as I know.

Hence, I regard the problem more as a lack of any argument ruling out agency, rather than the existence of a clear positive argument that agency will arise. Others may have different views on this.

I am thinking of this as a noise-reducing modification to the loss function, similar to using model-based rather than model-free learning (which, if done well, rewards/punishes a policy based on the average reward/punishment it would have gotten over many steps).

If science were incentivized via prediction market (and assuming scientists can make sizable bets by taking out loans), then the first person to predict a thing wins most of the money related to it. In other words, prediction markets are approximately parade-leader-incentivizing. 

But if there's a race to be the first to bet, then this reward is high-variance; Newton could get priority over Leibniz by getting his ideas to the market a little faster. 

You recommend dividing credit more to all the people who could have gotten information to the market, with some kind of time-discount for when they could have done it. If we conceive of "who won the race" as introducing some noise into the credit-assignment, this is a way to de-noise things.

This has the consequence of taking away a lot of credit from race-winners when the race was pretty big, which is the part you focus on; based on this idea, you want to be part of smaller races (ideally size 1). But, outside-view, you should have wanted this all along anyway; if you are racing for status, but you are part of a big race, only a small number of people can win anyway, so your outside-view probability of personally winning status should already be divided by the number of racers. To think you have a good chance of winning such a race you must have personal reasons, and (since being in the race selects, in part, for people who think they can win) they're probably overconfident.

So for the most part your advice has no benefit for calibrated people, since being a parade-leader is hard.

There are for sure cases where your metric comes apart from expected-parade-leading by a lot more, though. A few years ago I heard accusations that one of the big names behind Deep Learning earned their status by visiting lots of research groups and keeping an eye out for what big things were going to happen next, and managing to publish papers on these big things just a bit ahead of everyone else. This strategy creates the appearance of being a fountain of information, when in fact the service provided is just a small speed boost to pre-existing trends. (I do not recall who exactly was being accused, and I don't have a lot of info on the reliability of this assessment anyway, it was just a rumor.)

Typically, people say that the market is mostly efficient, and if there was financial alpha to be gained by doing hiring differently from most corporations, then there would already be companies outcompeting others by doing that. Well, here's a company doing some things differently and outcompeting other companies. Maybe there aren't enough people willing to do such things (who have the resources to) for the returns to reach an equilibrium?

Well, it could be that the practices lead to high-variance results, so that you should mostly expect companies which operate like that to fail, but you also expect a few unusually large wins. 

But I'm not familiar enough with the specific case to say anything substantial.

I am not sure whether I am more excited about 'positive' approaches (accelerating alignment research more) vs 'negative' approaches (cooling down capability-gain research). I agree that some sorts of capability-gain research are much more/less dangerous than others, and the most clearly risky stuff right now is scaling & scaling-related.

So you agree with the claim that current LLMs are a lot more useful for accelerating capabilities work than they are for accelerating alignment work?

Hmm. Have you tried to have conversations with Claude or other LLMs for the purpose of alignment work? If so, what happened?

For me, what happens is that Claude tries to work constitutional AI in as the solution to most problems. This is part of what I mean by "bad at philosophy". 

But more generally, I have a sense that I just get BS from Claude, even when it isn't specifically trying to shoehorn its own safety measures in as the solution.

Any thoughts on the sort of failure mode suggested by AI doing philosophy = AI generating hands? I feel strongly that Claude (and all other LLMs I have tested so far) accelerate AI progress much more than they accelerate AI alignment progress, because they are decent at programming but terrible at philosophy. It also seems easier in principle to train LLMs to be even better at programming. There's also going to be a lot more of a direct market incentive for LLMs to keep getting better at programming.

(Helping out with programming is also not the only way LLMs can help accelerate capabilities.)

So this seems like a generally dangerous overall dynamic -- LLMs are already better at accelerating capabilities progress than they are at accelerating alignment, and furthermore, it seems like the strong default is for this disparity to get worse and worse. 

I would argue that accelerating alignment research more than capabilities research should actually be considered a basic safety feature.

Load More