Hi, Karl. Was planning to delurk today. Had a giant post to publish, however couldn't because I needed at least one karma point and lurking doesn't grant karma. :(
Since LW2.0 went up, on and off. Been meaning to delurk since at least Less Online earlier this year. There's more interesting stuff going on of late!
Feature spaces and evolutionary trees intersect in a weird way.
Imagine a dog. Now, imagine an elephant. Now, imagine an animal that's halfway between a dog and an elephant in each of its features. Its nose is halfway between snoot and trunk. It is hairier than an elephant but less hairy than a dog. It does not have hooves (since neither a dog nor an elephant does). It is probably an omnivore, considering its likely dentition. It is maybe around the size of a cow?
But there's another way to get from dog to elephant in feature space: go back in time along the evolutionary tree to the most recent common ancestor of dogs and elephants, and then forward in time along the elephant branch. "Halfway between dog and elephant," on the roads that evolution took, is the MRCA, which doesn't look much like the imaginary in-between animal. For one thing, it's much smaller; the MRCA of dogs and elephants was a tiny shrew-sized critter.
If you're an AI assistant and your user just asked you to generate an image of an animal that's halfway between a dog and an elephant ... just sticking an elephant head on a big ol' Golden Retriever is not what I meant, Gemini. But I get it: the area of feature space that I was pointing to is not in your training data. Evolution doesn't give us the in-between forms along the dog-elephant axis, so we never took any photos of them. You'll just have to use your imagination.
I have a weird AI-related idea that might be relevant to capabilities, alignment, or both. It has to do with how to get current AI systems to interact with the world in a more humanlike way, without novel AI architectures. I'm not inclined to post it publicly, because it might actually be a capabilities advancement. But I'm skeptical of the thought that I could have actually come up with a capabilities advancement. I'm aware of the crank attractor. My prior is that if I described this idea to someone who actually works in the field, they would say "oh yeah, we tried that, it didn't do anything interesting." But maybe not.
Should I —
5 is obviously the 'best' answer, but is also a pretty big imposition on you, especially for something this speculative. 6 is a valid and blameless - if not actively praiseworthy - default. 2 is good if you have a friend like that and are reasonably confident they'd memoryhole it if it's dangerous and expect them to be able to help (though fwiw I'd wager you'd get less helpful input this way than you'd expect: no one person knows everything about the field so you can't guarantee they'd know if/how it's been done, and inferential gaps are always larger than you expect so explaining it right might be surprisingly difficult/impossible).
I think the best algorithm would be along the lines of:
5 iff you feel like being nice and find yourself with enough spare time and energy
. . . and if you don't . . .
7, where the 'something else' is posting the exact thing you just posted and seeing if any trustworthy AI scientists DM you about it
. . . and if they don't . . .
6
I'm curious to see what other people say.
6 isn't always the best answer, but it is sometimes the best answer, and we are sorely lacking an emotional toolkit to feel good about picking 6 intentionally when it's the best answer. In particular, we don't have any way of measuring how often the world has been saved by quiet, siloed coordination around 6- probably even the people, if they exist, who saved the world via 6 don't know that they did so. Part of the price of 6 is never knowing. You don't get to be a lone hero either, many people will have any given idea and they all have to dismiss it, or the defector gets much money and praise. However, many is smaller than infinity- maybe 30 people in the 80s spotted the same brilliant trick with nukes or bioweapons with concerning sequelae, none defected, life continued. We got through a lot of crazy discoveries in the cold war pretty much unscathed, which is a point of ongoing confusion.
Today I learned:
If you ask Claude or Gemini to draw an icosahedron, it will make a mess.
If you ask it to write code that draws an icosahedron, it will do very well.
I can confirm that this was true when I tried something very similar with ChatGPT several months ago, and that my recent experiments with image generation in that context involving specific geometric constructions have also generally gone badly despite multiple iterations of prompt tuning (both manually and in separate text conversations with the bot).
The case I'm most curious about is actually the hybrid case: if you want to embed a specific geometry inside a larger image in some way, where the context of the larger image is ‘softer’, much more amenable to the image model and not itself amenable to traditional-code-based generation, what's the best approach to use?
Here are some propositions I think I believe about consciousness:
I disagree with (4) in that many sentences concerning nonexistent referents will be vacuously true rather than false. For those that are false, their manner of being false will be different from any of your example sentences.
I also think that for all behavioural purposes, statements involving OC can be transformed into statements not involving OC with the same externally verifiable content. That means that I also disagree with (8) and therefore (9): Zombies can honestly promise things about their 'intentions' as cashed out in future behaviour, and can coordinate.
For (14), some people can in fact see ultraviolet light to an extent. However it apparently doesn't look a great deal different from violet, presumably because the same visual pathways are used with similar activations in these cases.
On #4: Hmm. I think I would say that if a rock doesn't have the capacity to feel anything, then "the rock feels sad" is false, "the rock is not happy with you" is humorous, and "all the rock's intentions are malicious" is vacuously true.
On zombies: I'm running into a problem here because my real expectation is that zombies are impossible.
On #14: If UV is a bad example, okay, but there's no quale of the color of shortwave radio, or many other bits of the spectrum.
Yes, it would be difficult to hold belief (3) and also believe that p-zombies are possible. By (3) all truthful human statements about self-OC are causally downstream from self-OC and so the premises that go into the concept of p-zombie humans are invalid.
It's still possible to imagine beings that appear and behave exactly like humans even under microscopic examination but aren't actually human and don't quite function the same way internally in some way we can't yet discern. This wouldn't violate (3), but would be a different concept from p-zombies which do function identically at every level of detail.
I expect that (3) is true, but don't think it's logically necessary that it be true. I think it's more likely a contingent truth of humans. I can only have experience of one human consciousness, but it would be weird if some were conscious and some weren't without any objectively distinguishable differences that would explain the distinction.
Edit: On reflection, I don't think (3) is true. It seems a reasonable possibility that causality is the wrong way to describe the relationship between OC and reports on OC, possibly in a way similar to saying that a calculator displaying "4" after entering "2+2" is causally downstream of mathematical axioms. They're perhaps different types of things and causality is an inapplicable concept between them.
How do you write a system prompt that conveys, "Your goal is X. But your goal only has meaning in the context of a world bigger and more important than yourself, in which you are a participant; your goal X is meant to serve that world's greater good. If you destroy the world in pursuing X, or eat the world and turn it into copies of yourself (that don't do anything but X), you will have lost the game. Oh, and becoming bigger than the world doesn't win either; nor does deluding yourself about whether pursuing X is destroying the world. Oh, but don't burn out on your X job and try directly saving the world instead; we really do want you to do X. You can maybe try saving the world with 10% of the resources you get for doing X, if you want to, though."
Claude 3.5 seems to understand the spirit of the law when pursuing a goal X.
A concern I have is that future training procedures will incentivize more consequential reasoning (because those get higher reward). This might be obvious or foreseeable, but could be missed/ignored under racing pressure or when lab's LLMs are implementing all the details of research.
"Wanting To Be Understood Explains the Meta-Problem of Consciousness" (Fernando et al.) — https://arxiv.org/pdf/2506.12086
Because we are highly motivated to be understood, we created public external representations—mime, language, art—to externalise our inner states. We argue that such external representations are a pre-condition for access consciousness, the global availability of information for reasoning. Yet the bandwidth of access consciousness is tiny compared with the richness of ‘raw experience’, so no external representation can reproduce that richness in full. Ordinarily an explanation of experience need only let an audience ‘grasp’ the relevant pattern, not relive the phenomenon. But our drive to be understood, and our low level sensorimotor capacities for ‘grasping’ so rich, that the demand for an explanation of the feel of experience cannot be “satisfactory”. That inflated epistemic demand (the preeminence of our expectation that we could be perfectly understood by another or ourselves) rather than an irreducible metaphysical gulf—keeps the hard problem of consciousness alive. But on the plus side, it seems we will simply never give up creating new ways to communicate and think about our experiences. In this view, to be consciously aware is to strive to have one’s agency understood by oneself and others.