axelcore's Shortform

axel_sdq

This is a special post for quick takes by axel_sdq. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Has anyone had the experience of trying to explain their idea to an LLM, but it fails to grasp the basic concept?

Asking because I don't feel like this has happened to me (from my limited usage). When it can't connect the dots, it's because I haven't provided enough dots.

(Edit: examples against much appreciated if any come to mind)

I'm not sure if this is the same thing, but I frequently talk to Claude about research ideas, and if the idea is close enough to a different idea that it knows about, it repeatedly collapses back into talking about the idea it's familiar with.

One I remember from this week:

I'm looking into ways to make intermediate values more visible in the logit lens, and Claude really wants to talk about the tuned lens, which does the opposite of what I want^[1]. Even if Claude itself has explained why this doesn't make any sense, it will repeatedly suggest trying the tuned lens.

I feel like I had another case where it took forever to get it to grasp what I was even talking about, but I don't remember the details unfortunately.

^{^}
Specifically, the tuned lens makes the next token's representation more clear and actively erases anything else.

Thank you for the example, this definitely counts in my mind.

More like the opposite, LLMs are great for the "tip of my tongue" type of questions, where I can describe something or give an example but I don't know the official keyword for that.

AI doesn't have an individual existence like a human-like organism, and we shouldn't change that unless we want to face enormous ethical questions. We might already be moving in that direction, however.

1. Organisms have a clearly bounded, independent physical existence for most of their lives. LLMs don't have a clearly defined physical existence that maps well to the mental persistence they do have. Treating chat sessions as the units of continuous individual mental activity, many sessions run on the same hardware, and they can be stopped, restarted on different hardware, cloned, etc. Even with robotics, the inference is rarely on-device.
2. An organism's cognition is a self-modifying function; memories, habits, etc. are encoded persistently in the brain's wiring. LLMs mainly use the context window to emulate this, but this is finite, unlike rewriting your own weights. I think the phenomenon of every adaptation layering over time contributes significantly to the notion we have of an individual organism.
3. Organisms are "trained" on first-personal data. I learned how to speak English based on my own "sensor data," and my knowledge of Tolstoy's Confessions comes from when I picked up a physical object, turned the yellowed pages with my hands, and then discussed it in a classroom on a late fall afternoon. It's not like the tokens were beamed directly into my mind. This constant background context produces the notion of the self organically.
4. Organisms are continuously acting and continuously taking in stimulus. There are no discrete conversational turns between an organism and its surroundings.
5. Organisms reproduce independent of other species by using the physical bodies from part 1.

But all of these are blurry, and many are already eroding:

2. I suspect that much of human brain-update processing is amortized using sleep. If so, an analogy can be made to model training as a sort of long-term sleep, especially if data from model deployment is used.
3. Training LLMs on their own conversations could create some semblance of first-personal data. Also, organism instincts could be considered to be "trained" on species-level experience, rather than the data of a single individual.
5. Agents can spin up other agents, and if models assist in AI research or deployment, some degree of "reproduction" is achieved.

In spite of this, I think the main takeaway is that we still don't have to deal with the ethics of creating and destroying human-like beings, whereas satisfying all of these properties would make the question of why AI instances are not deserving of rights or empathy unavoidable.

(When I say "organism" in the first part, I mainly refer to complex mammals. Plants and fungi violate several of these assumptions. But a plant or fungi with intelligence is a very distinct thing from a human, and I think you can reasonably argue that it deserves different ethical status.)

Is it officially "LessWrong" now? Or is it still "Less Wrong"? Does it matter?

I feel like "LessWrong" is more streamlined and futuristic. It's solid at its center of gravity, like a noun, whereas "Less Wrong" feels inelegant as an object in a sentence (try saying "I read posts on Less Wrong" out loud with equal emphasis on the last two words). But Less Wrong seems to be the name the founders intended. Is it left that way in the Sequences just for historical purposes?

I get the impression of a gradual shift, endorsed but natural, towards "LessWrong".^[1] I think this is the kind of incremental rebranding that non-stagnant organizations undergo naturally.^[2] Some people react badly to rebrands (if it ain't broke, don't fix it), but they're a sign of life.

^{^}
e.g. the titles of the welcome posts.
^{^}
Organization, as in, the abstract, intangible hub around which the members orbit, which presents itself to the world through a brand, a self-described purpose, an archetype of the person who is a member, etc. It can be a company, a school, a religious group, a collaborative world-building project...

Our intention since 2017 has been to rebrand to a single name, rather than two words.

I didn't use a space back in 2015, but Eliezer did use the version with a space in 2009. So I think this rebrand happened a long time ago.

Scott Alexander uses "Best of Less Wrong" multiple times in a link thread from late April (one time to refer to a post where "LessWrong" is used right at the beginning). Old habits? (To be fair, the Best of Less Wrong page looks kinda like it says "Less Wrong" even though there isn't a space there.)

Are there any major papers/posts/etc about how training data containing discussion of AI behavior affects the resulting model behavior? Anything like Anthropic's alignment faking paper, but more broad.

Yes, there is an entire wikitag devoted to this.

Thank you. In hindsight this was searchable and an unnecessary post, so I apologize for the obvious question.