Sahil — LessWrong

fufstsahil@gmail.com

This section here has some you might find useful, among wriing that is published. Excerpted below:

Live theory in oversimplified claims:
Claims about AI.
Claim: Scaling in this era depends on replication of fixed structure. (eg. the code of this website or the browser that hosts it.)
Claim: For a short intervening period, AI will remain only mildly creative, but its cost, latency, and error-rate will go down, causing wide adoption.
Claim: Mildly creative but very cheap and fast AI can be used to turn informal instruction (eg. prompts, comments) into formal instruction (eg. code)
Implication: With cheap and fast AI, scaling will not require universal fixed structure, and can instead be created just-in-time, tailored to local needs, based on prompts that informally capture the spirit.
(This implication I dub "attentive infrastructure" or "teleattention tech", that allows you to scale that which is usually not considered scalable: attentivity.)

Claims about theorization.
Claim: General theories are about portability/scaling of insights.
Claim: Current generalization machinery for theoretical insight is primarily via parametrized formalisms (eg. equations).
Claim: Parametrized formalisms are an example of replication of fixed structure.
Implication: AI will enable new forms of scaling of insights, not dependent on finding common patterns. Instead, we will have attentive infrastructure that makes use of "postformal" theory prompts.
Implication: This will enable a new kind of generalization, that can create just-in-time local structure tailored to local needs, rather than via deep abstraction. It's also a new kind of specialization via "postformal substitution" that can handle more subtlety than formal operations, thereby broadening the kind of generalization.

These leave out the relevance to risk. That's the job of this paper: Substrate-Sensitive AI-risk Management. Let me know if these in combination lay it out more clearly!

unaligned ASI is extremely sensitive to context, just in the service of its own goals.

Risks of abuse, isolation and dependence can skyrocket indeed, from, as you say, increased "context-sensitivity" in service of an AI/someone else's own goals. A personalized torture chamber is not better, but in fact quite likely a lot worse, than a context-free torture chamber. But to your question:

Is misalignment really is a lack of sensitivity as opposed to a difference in goals or values?

The way I'm using "sensitivity": sensitivity to X = the meaningfulness of X spurs responsive caring action.

It is unusual for engineers to include "responsiveness" in "sensitivity", but it is definitely included in the ordinary use of the term when, say, describing a person as sensitive. When I google "define sensitivity" the first similar word offered is, in fact, "responsiveness"!

So if someone is moved or stirred only by their own goals, I'd say they're demonstrating insensitivity to yours.

Semantics aside, and to your point: such caring responsiveness is not established by simply giving existing infrastructural machinery more local information. There are many details here, but you bring up an important specific one:

figuring out how to ensure an AI internalises specific values

which you wonder if is not the point of Live Theory. In fact, it very much is! To quote:

A very brief word now on problems of referentiality and their connections to sensitivity. One of the main concerns of the discourse of aligning AI can also be phrased as issues with internalization: specifically, that of internalizing human values. That is, an AI’s use of the word “yesterday” or “love” might only weakly refer to the concepts you mean. This worry includes both prosaic risks like “hallucination” (maybe it thinks “yesterday” was the date Dec 31st 2021, if its training stops in 2022) and fundamental ones like deep deceptiveness (maybe it thinks “be more loving” is to simply add more heart emojis or laser-etched smileys on your atoms). Either way, the worry is that the AI’s language and action around the words^[22] might not be subtly sensitive to what you or I might associate with it.

Of course, this is only mentioning the risk, now how to address it. In fact, very little of this post is talking concrete details about the response to threat model. It's the minus-first post, after all. But the next couple of posts start to build up to how it aims to address these worries. In short: there is a continuity between these various notions expressed by "sensitivity" that has not been formally captured. There is perhaps no one single formal definition of "sensitivity" that unifies them, but there might be a usable "live definition" articulable in the (live) epistemic infrastructure of the near future. This infrastructure is what we can supply to our future selves, and it should help our future selves understand and respond to the further future of AI and its alignment.

This means being open to some amount of ontological shifts in our basic conceptualizations of the problem, which limits the amount you can do by building on current ontologies.

Lots of interesting ideas here
I can see potential advantages in this kind of indirect approach vs. trying to directly define or learn a universal objective.

I'm glad! And thank you for your excellent questions!

Thank you, Dusan!

Next time there will be more notice, and also a more refined workshop!

Great! I'd love to have included a remark that one, as a human, might anticipate forward-chainy/rational reasoning in these systems because we're often taking the "thought" metaphor seriously/literally in the label "chain-of-thought", rather than backwardy/rationalization "reasoning".

But since it is is at least somewhat intelligent/predictive, it can make the move of "acausal collusion" with its own tendency to hallucinate, in generating its "chain"-of-"thought". That is, the optimization to have chain-of-thought in correspondence with its output can work in the backwards direction, cohering with bad output instead of leading to better output, a la partial agency.

(Admittedly human thoughts do a lot of rationalization as well. So maybe the mistake is in taking directionality implied by "chain" too seriously?)

Maybe this is obvious, but it could become increasingly reckless to not notice when you're drawing the face of "thoughts" or "chains" on CoT shoggoth-movements . You can be misled into thinking that the shoggoth is less able to deceive than it actually is.

Less obvious but important: in the reverse direction, drawing "hacker faces" on the shoggoth, as in the case of the Docker hack (section 4.2.1), can mislead into thinking that the shoggoth "wants" to or tends to hack/undermine/power-seek more than it actually, independently does. It seems at least somewhat relevant that the docker vulnerability was exploited for a challenge that was explicitly about exploiting vulnerability. Even though it was an impressive meta-hack, one must wonder how much this is cued by the prompt and therefore is zero evidence for an autonomy telos---which is crucial for the deceptive optimizer story---even though mechanistically possible.

(The word "independently" above is important: if it takes human "misuse"/participation to trigger its undermining personas, we also might have more of a continuous shot at pausing/shutdown or even corrigibilty.)

I was going to post this as a comment, but there's also an answer here: I'd say calling o1 "deceptive" could be as misleading as calling it aligned if it outputs loving words.

It has unsteady referentiality, at least from the POV of the meanings of us life-forms. Even though it has some closeness to our meanings and referentiality, the quantity of the unsteadiness of that referentiality can be qualitative. Distinguishing "deceptively aligned mesa-optimizer" from "the tentacles of the shoggoth I find it useful to call 'words' don't work like 'words' in some annoying ways" is important, in order to protect some of that (quantitatively-)qualitative difference. Both for not dismissing risks and for not hallucinating them.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments

Sequences

Posts

Wikitag Contributions

Comments

Live theory in oversimplified claims:

Claims about AI.

Claims about theorization.