A Scanner Brightly — LessWrong

I appreciate this! I wonder on the following:

To what degree can we extrapolate our biological inclinations to a species of intelligence which will lack the evolutionary basis of them? Put another way: how much of human camaraderie and benevolence are genetically selected drives as opposed to inherent properties of intelligence? (I'm not saying that any nonhuman intelligence or designed intelligence is fundamentally inscrutable or alien, nor that we cannot possibly imbue biological drives into an artificial intelligence)
Might alignment someday (for a time?) resemble a kind of highly functional alliance, based on the idea that humans have, materially, most of the stuff, and are extremely dangerous? A rudimentary scenario would be an ASI who would like access to resources humans possess, and cannot reasonably avoid getting nuked in the process of taking them? I am aware that this framing may fundamentally differ from the concept of alignment as is often posited, it is not a long term solution, but I believe it's reasonable to see it as an outcome, given how much of human interaction on the macro and micro seems to be based off flavors of mutually assured destruction.

Towards a Typology of Strange LLM Chains-of-Thought

Re: questions which lead to this most strongly: my experience has been that giving the model a variety of tasks and constraints across a long context can lead to this somewhat reliably- for example, taking the model through a creative exercise with constraints like a specific kind of poetic meter, then asking it to generalize in an unrelated task, then asking it to (and so on) until the activations are so complex and "muddled" the model struggles with coherent reasoning through the establishes context.

I apologize for the imprecision and hope this can be useful!

Emergent Introspective Awareness in Large Language Models

A Scanner Brightly24d20

This paper makes me really curious:

Could models trained to introspect "daydream" or boost features of their choosing, a la "don't think about aquariums"?
A continuation of #1, could models be given a tool to boost their own features to assist with specific tasks, (for example, features related to reasoning/hard problem solving) and would this lead to meaningful results/be useful?
Perhaps less fanciful than the above, could smaller models have permanent feature boostings to help focus them to narrow tasks?

Towards a Typology of Strange LLM Chains-of-Thought

A Scanner Brightly1mo30

I love this! I'm particularly intrigued by #3, which matches my anecdotal experience with reasoning models which are "overwhelmed" by a task: as your example depicts, there is a period of incoherence, a "stabilization" of sorts, then a successful (or at least, sensical) resumption of task at hand.

I'm very curious to see how CoT research will evolve for LLMs. It seems both incredibly useful and incredibly brittle (as an interpretability/alignment tool) given that models like Opus 4 are entirely capable of completing a complex task while never once referencing it in the CoT (when directed to do so).

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments