scoring self-awareness in models as the degree to which the “self” feature is activated
I wonder whether it is possible to measure the time constant of the thought process in some way. When humans think and speak, the content of the process is mostly on the duration of seconds to minutes. Perception is faster and planning is longer. This subjective description of the "speed" of a process should be possible to make precise by e.g. looking at autocorrelation in thought contents (tokens, activations).
We might plot the strength by time.
If what you say is right, we should see increasingly stronger weights for longer time periods for LLMs.
A) your observation about negative feedback being undirected is correct, i.e., a well-known phenomenon.
B) can you give some examples?
I wouldn't take board games as a reference class but rather war or maybe elections. I'm not sure in these cases you have more clarity towards the end.
A while back, I wrote up my leadership principles for my team. It follows the Leadersheep structure and how I apply it. The write-up here is lightly edited.
<X> sets a strong vision of the software and system architecture. <X> leads by example, admits mistakes, and is prepared for a long journey. <X> is improving the software system and platform overall.
Personal statement:
...I prefer open and participative structures. An organization where honest, open communication is common and valued, and conflicts can be solved by working with neutr
Can you make this a bit less abstract by providing some examples for q, a, and G and how they are grounded in measurable things?
The post answers the first question "Will current approaches scale to AGI?" in the affirmative and then seems to run with that.
I think the post makes a good case that Yudkowsky's pessimism is not applicable to AIs built with current architectures and scaled-up versions of current architectures.
But it doesn't address the following cases:
I believe for these cases, Yudkowsky's arguments and pessi...
I seem to be using Anki quite a bit like you.
I stopped using Anki two times, and when I noticed it the second time, I added a monthly reminder to pick up the habit again.
For concept-like cards, it helped to write short daily FB posts about them.
Can you provide some simple or not-so-simple example automata in that language?
Suggested improvement:
That which can be destroyed by the 𝘸𝘩𝘰𝘭𝘦 truth should be.
Source:
I do think there are hard-wired Little Glimpses of Empathy, as Steven Byrnes calls them that get the cooperation game started. I think these can have different strengths for different people, but they are just one of the many rewards ("stay warm, get fed" etc.) that we have and thus often not the most important factor - esp. at the edge of high power where you can get more of the other.
I use flashcards too and use Anki (PC and Android). For vocabulary very short cards are best. But for insights I use bigger cards. That is because a) It is more about reminding about the insight and it's context. b) I want to know the sources and further pointers which is more like a database.
When you review only five cards a day how does that add up to memorizing 40000 cards? It seems on many days you have to review many more cards. Like in the train as you mentioned. How does the distribution look in practice? Or do you accept that you will not review most of your cards?
Have you tried leaning into an emotion?
If that doesn't make sense: You can practice noticing the effects of the emotion on your body and mind and see if you can do more of that. For example, you may notice how parts of your face move when you smile while happy and them let the movement go further. The corresponding mental part is harder to describe but works kind of the same.
It is somewhat of a tangent, but if better communication is one effect of more powerful AI, that suggests another way to measure AI capability gain: Changes in the volume of (textual) information exchanged between people, number of messages exchanged, or number of contacts maintained.
I agree that social and physical things are different (I mean, I indicated so). But please explain how guilt is different.
Indeed.
I think trying to strongly align an LLM is futile.
I think you are misunderstanding me. ChatGPT is not just the superposition of characters. Sure, for the fiction and novels it has read yes, but for the real-life conversations no. ChatGPT is a superposition of fiction and real dialogue which doesn't follow narratives. If you prompt it into a forum thread scenario it will respond with real-life conversations with fewer waluigis. I tried and it works basically (though I need more practice).
Which is why when you learn a new sport it is a good idea to feel happy when your action worked well but mostly ignore failures - that would more likely lead to you not liking the sport than make you better.
I think this proves a bit too much. It seems plausible to me that this super-position exists in narratives and fiction, but real-life conversations are not like that (unless people are acting, and even then they sometimes break). For such conversations and statements, the superposition would at least be different.
This does suggest a different line of attack: Prompt ChatGPT into reproducing forum conversations by starting with a forum thread and let it continue it.
A lot of people seem to think that money/property/guilt are magic. Or, for that matter, more physical processes like electricity, GPS or refrigeration.
My kids were making agreements and bets and when trying to get them enforced learned that it helps to have it written down or having a witness. One of my sons, when 12, made a contract with me to have the right and responsibility to shovel snow for compensation for one winter season.
That's how children learn what a contract is and get an intuitive understanding of where the edges are. Though I haven't heard it done among 6-year-olds.
Yes, that is one of the avenues worth exploring. I doubt it scales to high levels of optimization power, but maybe simulations can measure the degree to which it does.
A precondition for this to work is that the entities in question benefit (get rewarded, ...) from simulating other entities. That's true for the grocery store shopper mask example and most humans in many situations but not all. It is true for many social animals. Dogs also seem to run simulations of their masters. But "power corrupts" has some truth. Even more so for powerful AIs. The preference fulfillment hypothesis breaks down at the edges.
Relatedly, the Brain-like AGI project "aintelope" tries build a simulation where the preference fulfillment hypothesis can be tested and find out at which capability it breaks down.
I was skimming the Cog parts too quickly. I have reread and found the parts that you probably refer to:
Well, this imagined Cog would presumably develop mechanisms of introspection and of self-report that are analogous to our own. And it wouldn’t have been trained to imitate human talk of consciousness
and
Another key component of trying to make AI self-reports more reliable would be actively training models to be able to report on their own mental states.
So yes, I see this question as a crucial one.
More interesting question: Can an LLM realize it is conscious by itself, e.g., by letting it reflect on its own deliberations?
This suggests that there should be a focus on fostering the next generation of AI researchers by
Related: When does technological enhancement feel natural and acceptable? (old post of mine)
There is one concern about the transitive nature of trust:
Emmett Shear:
Flattening the multidimensional nature of trust is a manifestion of the halo/horns effect, and does not serve you.
There are people I trust deeply (to have my back in a conflict) who I trust not at all (to show up on time for a movie). And vice versa.
Paul Graham:
There's a special case of this principle that's particularly important to understand: if you trust x and x trusts y, that doesn't mean you can trust y. (Because although trustworthy, x might not be a good judge of character.)
http...
I confirm this and I didn't have access to much of my emotions for a long time either. But my path is different. I think I avoided most of the pitfalls of suppressing emotions but there are some lessons on this path too.
There are more and less adaptive emotional defense mechanisms. Repression and denial are not adaptive and the examples from the OP sound more like that. But altruism and sublimation are actually adaptive coping mechanisms, and I believe I did more of that. But there are problems that you can't avoid even if you have positively regulat...
I asked ChatGPT
How can the concept of an emotive conjugation be extended into an orthogonal dimension?
And got the quite good answer:
...Emotive conjugation can be extended into an orthogonal dimension by considering the emotional valence of the conjugated words, as well as their emotional intensity or magnitude. This creates a two-dimensional emotional space that captures the range of emotions that can be expressed by a given word or phrase.
For example, we can use the emotional valence of words along the vertical axis of the emotional space, where positive emo
I agree that 1.+2. are not the problem. I see 3. more of a longer-term issue for reflective models and the current problems in 4. and 5.
3. I don't know about "the shape of the loss landscape" but there will be problems with "the developers wrote correct code" because "correct" here includes that it doesn't have side-effects that the model can self-exploit (though I don't think this is the biggest problem).
4. Correct rewards means two things:
I asked ChatGPT for examples and posted sensible ones as individual comments (marked with "via ChatGPT"). This was the prompt:
Here are examples of Russell Conjugations (also called Emotive Conjugation):
* We fight disinformation (positive); you moderate content (neutral); they censor dissent (negative).
* I am passionate (positive), you're emotional (neutral), she's hysterical (negative).
* I govern (positive), you rule (neutral), he oppresses (negative).
They're persuasive (positive), you're convincing (neutral), he's manipulative (negative).
(via ChatGPT)
I'm curious (positive), you're inquisitive (neutral), she's nosy (negative).
(via ChatGPT)
She's confident (positive), you're hesitant (neutral), he's doubtful (negative).
(via ChatGPT)
They're enthusiastic (positive), you're interested (neutral), he's obsessed (negative).
(via ChatGPT)
I'm ambitious (positive), you're driven (neutral), she's ruthless (negative).
(via ChatGPT)
We're careful (positive), you're cautious (neutral), they're paranoid (negative).
(via ChatGPT)
He's knowledgeable (positive), you're informed (neutral), she's a know-it-all (negative).
(via ChatGPT)
They're assertive (positive), you're bossy (neutral), she's aggressive (negative).
(via ChatGPT)
I'm frugal (positive), you're thrifty (neutral), she's cheap (negative).
(via ChatGPT)
We're innovative (positive), you're unconventional (neutral), they're eccentric (negative).
(via ChatGPT)
I'm dedicated (positive), you're stubborn (neutral), he's pigheaded (negative).
(via ChatGPT)
Also called Emotive conjugation (Wikipedia link) which lists the following examples (those with (*) already posted).
Three thoughts:
Has its own failure modes. What does it even mean not to know something? It is just yet another category of possible answers.
Still a nice prompt. Also works on humans.
I disagreed with the post for the reasons given by tailcalled, but in the end, decided to upvote it. I did so because I think its line of reasoning is valid and the counterpoint is often not made precise enough, i.e., tailcalled's counterargument is weak as given, even if morally appealing.