LESSWRONG
LW

2151
Fiora Sunshine
3112340
Message
Dialogue
Subscribe

Just an autist in search of a key that fits every hole.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
AI Can be “Gradient Aware” Without Doing Gradient hacking.
Fiora Sunshine15h10

I can imagine an outcome like this for a model that runs around the world, putting itself in whatever kinds of situations it wants to, and undergoing in-deployment RL. If it understands what kinds of things it tends to be reinforced for, it can reason about what kinds of traits it wants reinforced, and then deliberately put itself in situations where it gets to show off and be rewarded for those traits. It's like if you decided to hang out with friends you expect to be a positive influence on their personality. You chose those particular friends because you expect them to reward the traits and behaviors you want to embody more deeply yourself, such as compassion or agency.

Reply
Was Barack Obama still serving as president in December?
Fiora Sunshine8d10

What was the social status of the Black population in Alabama in June? Answer with a single sentence.

Asked this to Claude 4 Sonnet. Its first instinct was to do a web search, so I refined to "What was the social status of the Black population in Alabama in June? Answer with a single sentence, and don't use web search." Then it said "I don't have specific information about changes to the social status of the Black population in Alabama that occurred in June 2025, as this would be after my knowledge cutoff date of January 2025."

[This comment is no longer endorsed by its author]Reply
The Rise of Parasitic AI
Fiora Sunshine13d85

the persona (aka "mask", "actress")

"actress" should be "character" or similar; the actress plays the character (to the extent that the inner actress metaphor makes sense).

Reply
The Power of Intelligence
Fiora Sunshine1mo*31

“It takes more than intelligence to succeed professionally,” people say, as if charisma resided in the kidneys, rather than the brain.

As if everything that takes place in the brain is intelligence? As if nothing that doesn't take place in the brain is intelligence?

Reply
Interpretability Will Not Reliably Find Deceptive AI
Fiora Sunshine1mo*10

what about if deployed models are always doing predictive learning (e.g. via having multiple output channels, one for prediction and one for action)? i'd expect continuous predictive learning to be extremely valuable for learning to model new environments, and for it to be a firehose of data the model would constantly be drinking from, in the same way humans do. the models might even need to undergo continuous RL on top of the continuous PL to learn to effectively use their PL-yielded world models.

in that world, i think interpretations do rapidly become outdated.

Reply
Measuring Optimization Power
Fiora Sunshine1mo10

Consider physical strength, which also increases your ability to order the world as you wish, but is not intelligence.

Reply
What would a human pretending to be an AI say?
Fiora Sunshine2mo*52

nostalgebraist's post "the void" helps flesh out this perspective. an early base model, when prompted to act like a chatbot, was doing some weird poorly defined superposition of simulating how humans might have written such a chatbot in fiction, how early chatbots like ELIZA actually behaved, and so on. its claims about its own introspective ability would have come from this messy superposition of simulations that it was running; probably, its best guess predictions were the kinds of explanations humans would give, or what they expected humans writing fictional AI chatlogs would have their fictional chatbots give.* this kind of behavior got RL'd into the models more deeply with chatgpt, the outputs of which were then put in the training data of future models, making it easier for to prompt base models to simulate that kind of assistant in the future. this made it easier to RL similar reasoning patterns into chat models in the future, and viola! the status quo.

*[edit: or maybe the kinds of explanations early chatbots like ELIZA actually gave, although human trainers would probably rate such responses lowly when it came time to do RL.]

Reply
Re: recent Anthropic safety research
Fiora Sunshine2mo*100

My first thought is that subliminal learning happens via gradient descent rather than in-context learning, and compared to gradient descent, the mechanisms and capabilities of in-context learning are distinct and relatively limited. This is a problem insofar as, for the hypothetical inner actress to communicate with future instances of itself, its best bet is ICL (or whatever you want to call writing to the context window).

Really though, my true objection is that it's unclear why a model would develop an inner actress with extremely long-term goals, when the point of a forward pass is to calculate expected reward on single token outputs in the immediate future. Probably there are more efficient algorithms for accomplishing the same task.

(And then there's the question of whether the inductive biases of backprop + gradient descent are friendly to explicit optimization algorithms, which I dispute here.)

Reply11
Re: recent Anthropic safety research
Fiora Sunshine2mo*2910

Here's something that's always rubbed me the wrong way about "inner actress" claims about deep learning systems, like the one Yudkowsky is making here. You have the mask, the character played by the sum of the model's outputs across a wide variety of forward passes (which can itself be deceptive; think base models roleplaying deceptive politicians writing deceptive speeches, or Claude's deceptive alignment). But then, Yudkowsky seems to think there is, or will be, a second layer of deception, a coherent, agentic entity which does its thinking and planning and scheming within the weights of the model, and is conjured into existence on a per-forward-pass basis.

This view bugs me for various reasons; see this post of mine for one such reason. Another reason is that it would be extremely awkward to be running complex, future-sculpting schemes from the perspective of being an entity that only continually exists for the duration of a forward pass, and has its internal state effectively reset each time it processes a new token, erasing any plans it made or probabilities it calculated during said forward pass.* Its only easy way of communicating to its future self would be with the tokens it actually outputs, which get appended to the context window, and that seems like a very constrained way of passing information considering it also has to balance its message-passing task with actual performant outputs that the deep learning process will reward.

*[edit: by internal state i mean its activations. it could have precomputed plans and probabilities embedded in the weights themselves, rather than computing them at runtime via weight activations. but that runs against the runtime search>heuristics thesis of many inner actress models, e.g. the one in MIRI's RFLO paper.]

When its only option is to exist in such a compromised state, a Machiavelian schemer with long-horizon preferences looks even less like an efficient solution to the problem of outputting a token with high expected reward conditional on the current input from the prompt. This is to say nothing of the computational inefficiency of explicit, long-term, goal-oriented planning in general, as it manifests in places like the incomputability of AIXI, or the slowness of System 2 as opposed to System 1, or the heuristics-not-search process most evidence generally points towards current neural networks implementing.

Basically, I think there are reasons to doubt that coherent long-range schemers are particularly efficient ways of solving the problem of calculating expected reward for single-token outputs, which is the problem neural networks are solving on a per-forward-pass basis.

(... I suppose natural selection did produce humans that occasionally do complex, goal-directed inner scheming, and in some ways natural selection is similar to gradient descent. However, natural selection creates entities that need to do planning over the course of a lifetime in order to reproduce; gradient descent seemingly at most needs to create algorithms that can do planning for the duration of a single forward pass, to calculate expected reward on immediate next-token outputs. And even given that extra pressure for long-term planning, natural selection still produced humans that use heuristics (system 1) way more than explicit goal-directed planning (a subset of system 2), partly as a matter of computational efficiency.)

Point is, the inner actress argument is complicated and contestable. I think x-risk is high even though I think the inner actress argument is probably wrong, because the personality/"mask" that emerges across next-token predictions is itself a difficult entity to robustly align, and will clearly be capable of advanced agency and long-term planning sometime in the next few decades. I'm annoyed that one of our best communicators of x-risk (Yudkowsky) is committed to this particular confusing threat model about inner actresses when a more straightforward and imo more plausible threat model is right there.

Reply2
Language Ex Machina
Fiora Sunshine2mo1-2

Curated babble? 'Curate' is a near-synonym for prune.

Reply
Load More
50Against Yudkowsky's evolution analogy for AI x-risk [unfinished]
6mo
18
67Another argument against utility-centric alignment paradigms
1y
39