All of Gunnar_Zarncke's Comments + Replies

I disagreed with the post for the reasons given by tailcalled, but in the end, decided to upvote it. I did so because I think its line of reasoning is valid and the counterpoint is often not made precise enough, i.e., tailcalled's counterargument is weak as given, even if morally appealing.

scoring self-awareness in models as the degree to which the “self” feature is activated

I wonder whether it is possible to measure the time constant of the thought process in some way. When humans think and speak, the content of the process is mostly on the duration of seconds to minutes. Perception is faster and planning is longer. This subjective description of the "speed" of a process should be possible to make precise by e.g. looking at autocorrelation in thought contents (tokens, activations).

We might plot the strength by time.

If what you say is right, we should see increasingly stronger weights for longer time periods for LLMs.

A) your observation about negative feedback being undirected is correct, i.e., a well-known phenomenon.

B) can you give some examples?

I wouldn't take board games as a reference class but rather war or maybe elections. I'm not sure in these cases you have more clarity towards the end.

A while back, I wrote up my leadership principles for my team. It follows the Leadersheep structure and how I apply it. The write-up here is lightly edited.

<X> sets a strong vision of the software and system architecture. <X> leads by example, admits mistakes, and is prepared for a long journey. <X> is improving the software system and platform overall.


Personal statement:

I prefer open and participative structures. An organization where honest, open communication is common and valued, and conflicts can be solved by working with neutr

... (read more)

Can you make this a bit less abstract by providing some examples for q, a, and G and how they are grounded in measurable things?

The post answers the first question "Will current approaches scale to AGI?" in the affirmative and then seems to run with that. 

I think the post makes a good case that Yudkowsky's pessimism is not applicable to AIs built with current architectures and scaled-up versions of current architectures. 

But it doesn't address the following cases:

  • Systems of such architectures 
  • Systems built by systems that are smarter than humans
  • Such architectures used by actors that do not care about alignment

I believe for these cases, Yudkowsky's arguments and pessi... (read more)

I seem to be using Anki quite a bit like you. 

I stopped using Anki two times, and when I noticed it the second time, I added a monthly reminder to pick up the habit again.

For concept-like cards, it helped to write short daily FB posts about them.

Can you provide some simple or not-so-simple example automata in that language?

5Vanessa Kosoy14d
Good idea! EXAMPLE 1 Fix some alphabet Σ. Here's how you make an automaton that checks that the input sequence (an element of Σ∗) is a subsequence of some infinite periodic sequence with period n. For every k in Z/n, let Ak be an automaton that checks whether the symbols in the input sequences at places i s.t. i≡k(modn) are all equal (its number of states is O(n|Σ|)). We can modify it to make a transducer A′k that produces its unmodified input sequence if the test passes and ⊥ if the test fails. It also produces ⊥ when the input is ⊥. We then chain A′0,A′1…A′n−1 and get the desired automaton. Alternatively, we can connect the Ak in parallel and then add an automaton B with n boolean inputs that acts as an AND gate. B is a valid multi-input automaton in our language because AND is associative and commutative (so we indeed get a functor on the product category). Notice that the description size of this automaton in our language is polynomial in n. On the other hand, a tabular description would be exponential in n (the full automaton has exponentially many states). Moreover, I think that any regular expression for this language is also exponentially large. EXAMPLE 2 We only talked about describing deterministic (or probabilistic, or monadic) automata. What about nondeterministic? Here is how you can implement a nondeterministic automaton in the same language, without incurring the exponential penalty of determinization, assuming non-free categories are allowed. Let B be some category that contains an object B and a morphism b:B→B s.t. b≠idB and b2=b. For example it can be the closed cartesian category freely generated by this datum (which I think is well-defined). Then, we can simulate a non-deterministic automaton A on category C by a deterministic transducer from C to B: * The state set is always the one element set (or, it can be two elements: "accept" and "reject"). * For every state of A, we have a variable of signature B→B. This variable is inten

Suggested improvement:

That which can be destroyed by the 𝘸𝘩𝘰𝘭𝘦 truth should be.


I do think there are hard-wired Little Glimpses of Empathy, as Steven Byrnes calls them that get the cooperation game started. I think these can have different strengths for different people, but they are just one of the many rewards ("stay warm, get fed" etc.) that we have and thus often not the most important factor - esp. at the edge of high power where you can get more of the other.

I use flashcards too and use Anki (PC and Android). For vocabulary very short cards are best. But for insights I use bigger cards. That is because a) It is more about reminding about the insight and it's context. b) I want to know the sources and further pointers which is more like a database.

When you review only five cards a day how does that add up to memorizing 40000 cards? It seems on many days you have to review many more cards. Like in the train as you mentioned. How does the distribution look in practice? Or do you accept that you will not review most of your cards?

3Tereza Ruzickova14d
Hi, thank you for your comment! Yes, exactly, I find that if I set the daily bar very low (e.g. 5 a day), and keep the habit alive in that way, I will occasionally get bursts of motivation that lead to much bigger numbers. I'd say once a week I may do 50-100 cards in one go, once a month it might be even more. But I never force myself into that, I just follow a motivation wave when it comes (and it's made easier by the cards being short and easy). This means that there will be periods without bigger waves and that's ok. I found the Tiny Habits book very useful for this approach.  Occasionally these waves result in there being too many cards in the revision "queue", which makes the daily habit harder (there are too many cards that I can't remember). In those times, I reset a bunch of the cards that are in the queue and learn them again from scratch later on.  In terms of the 40 000 cards, I should have clarified that I am not necessarily actively learning many of those. Many of them are just dormant in the database, because they were for example relevant in my university studies, but I don't find them relevant now. But I still like having them there in case I need them at some point. I also have many cards that I have already learnt many times and therefore they come back around once every year or two. I find that with the slow and steady approach, I do eventually learn all the cards that are currently relevant to me, I just have to be patient. If something needs to be learnt urgently, I put it in my ASAP desk which has much fewer cards and I get to them more quickly. Hope this makes sense, thank you for your interest!

Have you tried leaning into an emotion?

If that doesn't make sense: You can practice noticing the effects of the emotion on your body and mind and see if you can do more of that. For example, you may notice how parts of your face move when you smile while happy and them let the movement go further. The corresponding mental part is harder to describe but works kind of the same.

It is somewhat of a tangent, but if better communication is one effect of more powerful AI, that suggests another way to measure AI capability gain: Changes in the volume of (textual) information exchanged between people, number of messages exchanged, or number of contacts maintained.

I agree that social and physical things are different (I mean, I indicated so). But please explain how guilt is different.


I think trying to strongly align an LLM is futile.

1Bill Benzon25d
LLM as Borg? I think of LLMs as digital wilderness. You explore it, map out some territory that interests you, and then figure out how to "domesticate" it, if you can. Ultimately, I think, you're going to have to couple with a World Model.

I think you are misunderstanding me. ChatGPT is not just the superposition of characters. Sure, for the fiction and novels it has read yes, but for the real-life conversations no. ChatGPT is a superposition of fiction and real dialogue which doesn't follow narratives. If you prompt it into a forum thread scenario it will respond with real-life conversations with fewer waluigis. I tried and it works basically (though I need more practice).

8Cleo Nardo1mo
Oh, I misunderstood. Yep, you're correct, ChatGPT is a superposition of both fictional dialogue and forum dialogue, and you can increase the amplitude of forum dialogue by writing the dialogue in the syntax of forum logs. However, you can also increase the amplitude of fiction by writing in the dialogue of fiction, so your observation doesn't protect against adversarial attacks against chatbots. Moreover, real-life forums contain waluigis, although they won't be so cartoonishly villainous.

Which is why when you learn a new sport it is a good idea to feel happy when your action worked well but mostly ignore failures - that would more likely lead to you not liking the sport than make you better.

I think this proves a bit too much. It seems plausible to me that this super-position exists in narratives and fiction, but real-life conversations are not like that (unless people are acting, and even then they sometimes break). For such conversations and statements, the superposition would at least be different. 

This does suggest a different line of attack: Prompt ChatGPT into reproducing forum conversations by starting with a forum thread and let it continue it.

5Cleo Nardo1mo
  That's exactly the point I'm making! The chatbot isn't a unique character which might behave differently on different inputs. Rather, the chatbot is the superposition of many different characters, and their amplitude can fluctuate depending on how you interact with the superposition. 

A lot of people seem to think that money/property/guilt are magic. Or, for that matter, more physical processes like electricity, GPS or refrigeration.

The first thing to do is to distinguish human things from inhuman things. Physical things really are run by rigid laws. Social things like contracts, money, property, and a guilty verdict are caused by humans and this should make it obvious that they don't have rigid behavior. (The feeling of guilt is yet a third category.)

My kids were making agreements and bets and when trying to get them enforced learned that it helps to have it written down or having a witness. One of my sons, when 12, made a contract with me to have the right and responsibility to shovel snow for compensation for one winter season. 

That's how children learn what a contract is and get an intuitive understanding of where the edges are. Though I haven't heard it done among 6-year-olds.

Yes, that is one of the avenues worth exploring. I doubt it scales to high levels of optimization power, but maybe simulations can measure the degree to which it does.

A precondition for this to work is that the entities in question benefit (get rewarded, ...) from simulating other entities. That's true for the grocery store shopper mask example and most humans in many situations but not all. It is true for many social animals. Dogs also seem to run simulations of their masters. But "power corrupts" has some truth. Even more so for powerful AIs. The preference fulfillment hypothesis breaks down at the edges.

Relatedly, the Brain-like AGI project "aintelope" tries build a simulation where the preference fulfillment hypothesis can be tested and find out at which capability it breaks down.

Agreed - but I think that power corrupts because being put in a position of power triggers its own set of motivational drives evolved for exploiting that power. I think that if an AI wasn't built with such drives, power wouldn't need to corrupt it.

I was skimming the Cog parts too quickly. I have reread and found the parts that you probably refer to:

Well, this imagined Cog would presumably develop mechanisms of introspection and of self-report that are analogous to our own. And it wouldn’t have been trained to imitate human talk of consciousness


Another key component of trying to make AI self-reports more reliable would be actively training models to be able to report on their own mental states. 

So yes, I see this question as a crucial one.

More interesting question: Can an LLM realize it is conscious by itself, e.g., by letting it reflect on its own deliberations?   

To clarify, what question were you thinking that is more interesting than? I see that as one of the questions that is raised in the post. But perhaps you are contrasting "realize it is conscious by itself" with the methods discussed in "Could we build language models whose reports about sentience we can trust?"

This suggests that there should be a focus on fostering the next generation of AI researchers by

  • making the insights and results from the pioneers accessible and interesting and
  • making it easy to follow in the footsteps with a promise to reach the edge quickly. 

There is one concern about the transitive nature of trust:

Emmett Shear:

Flattening the multidimensional nature of trust is a manifestion of the halo/horns effect, and does not serve you.

There are people I trust deeply (to have my back in a conflict) who I trust not at all (to show up on time for a movie). And vice versa.

Paul Graham:

There's a special case of this principle that's particularly important to understand: if you trust x and x trusts y, that doesn't mean you can trust y. (Because although trustworthy, x might not be a good judge of character.)

http... (read more)

I confirm this and I didn't have access to much of my emotions for a long time either. But my path is different. I think I avoided most of the pitfalls of suppressing emotions but there are some lessons on this path too. 

There are more and less adaptive emotional defense mechanisms. Repression and denial are not adaptive and the examples from the OP sound more like that. But altruism and sublimation are actually adaptive coping mechanisms, and I believe I did more of that. But there are problems that you can't avoid even if you have positively regulat... (read more)

I asked ChatGPT

How can the concept of an emotive conjugation be extended into an orthogonal dimension?

And got the quite good answer:

Emotive conjugation can be extended into an orthogonal dimension by considering the emotional valence of the conjugated words, as well as their emotional intensity or magnitude. This creates a two-dimensional emotional space that captures the range of emotions that can be expressed by a given word or phrase.

For example, we can use the emotional valence of words along the vertical axis of the emotional space, where positive emo

... (read more)

I agree that 1.+2. are not the problem. I see 3. more of a longer-term issue for reflective models and the current problems in 4. and 5.

3. I don't know about "the shape of the loss landscape" but there will be problems with "the developers wrote correct code" because "correct" here includes that it doesn't have side-effects that the model can self-exploit (though I don't think this is the biggest problem).

4. Correct rewards means two things: 

  • a) That there is actual and sufficient reward for correct behavior. I think that was not the case with Bing.
  • b)
... (read more)

I asked ChatGPT for examples and posted sensible ones as individual comments (marked with "via ChatGPT"). This was the prompt:

Here are examples of Russell Conjugations (also called Emotive Conjugation): 

* We fight disinformation (positive); you moderate content (neutral); they censor dissent (negative). 

* I am passionate (positive), you're emotional (neutral), she's hysterical (negative). 

* I govern (positive), you rule (neutral), he oppresses (negative).

They're persuasive (positive), you're convincing (neutral), he's manipulative (negative).

(via ChatGPT)

2Daniel Kokotajlo1mo
I think convincing is more positive than persuasive.

I'm curious (positive), you're inquisitive (neutral), she's nosy (negative).

(via ChatGPT)

She's confident (positive), you're hesitant (neutral), he's doubtful (negative).

(via ChatGPT)

They're enthusiastic (positive), you're interested (neutral), he's obsessed (negative).

(via ChatGPT)

I'm ambitious (positive), you're driven (neutral), she's ruthless (negative).

(via ChatGPT)

We're careful (positive), you're cautious (neutral), they're paranoid (negative).

(via ChatGPT)

He's knowledgeable (positive), you're informed (neutral), she's a know-it-all (negative).

(via ChatGPT)

They're assertive (positive), you're bossy (neutral), she's aggressive (negative).

(via ChatGPT)

I'm frugal (positive), you're thrifty (neutral), she's cheap (negative).

(via ChatGPT)

We're innovative (positive), you're unconventional (neutral), they're eccentric (negative).

(via ChatGPT)

I'm dedicated (positive), you're stubborn (neutral), he's pigheaded (negative).

(via ChatGPT)


Also called Emotive conjugation (Wikipedia link) which lists the following examples (those with (*) already posted).

  • I am firm, you are obstinate, he/she is a pig-headed fool. (*)
  • I am sparking; you are unusually talkative; he is drunk.
  • I know my own mind; you like things to be just so; they have to have everything their way.
  • I am a freedom fighter, you are a rebel, and he is a terrorist. (*)
  • I am righteously indignant, you are annoyed, he is making a fuss over nothing. (*)
  • I have reconsidered the matter, you have changed your mind, he has gone back on his word. (*)
3Ian McKenzie1mo
The Wikipedia article has a typo in one of these: it should say "I am sparkling; you are unusually talkative; he is drunk." (as in the source [])

Three thoughts:

  1. If you set up the system like that, you may run into the mentioned problems. It might be possible wrap both into a single model that is trained together.
  2. An advanced system may reason about the joint effect, e.g. by employing fixed-point theorems and Logical Induction.
  3. Steven Byrne's [Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL models humans as having three components:
    1. world model that is mainly trained by prediction error 
    2. a steering system that encodes preferences over world states
    3. a  system t
... (read more)

Has its own failure modes. What does it even mean not to know something? It is just yet another category of possible answers. 

Still a nice prompt. Also works on humans.

Load More