For the non-replying disagreers, let me try with a few more words. I think my comment is a pretty decent one-line summary of the Vibe-awareness section, especially in light of the sections that precede it. If you glance through that part of the post again and still disagree, then I guess our mileage does just vary.
But many experienced prompt engineers have reported that prompting gets more effective when you use more words and just "tell it what you want". This type of language points to engaging your social know-how as opposed to trying to game out the system. See for instance https://generative.ink/posts/methods-of-prompt-programming/, which literally advocates an "anthropomorphic approach to prompt programming" and takes care to distinguish this from pernicious anthropomorphizing of the system. This again puts an emphasis on bringing your social self to the task.
Of course, in many situations the direct effect of talking to the system is session-bounded. But it still applies within the session, when prompt engineering is persisted or reused, and when session outputs are fed back into future sessions by any path.
Furthermore, as the models grow stronger, our ability to anticipate the operation of mechanism grows less, and the systems' ability to socialize on our own biological and cultural evolution-powered terms grows greater. This will become even more true if, as seems likely, architectures evolve toward continuous training or at least finer-grained increments.
These systems know a lot about our social behaviors, and more all the time. Interacting with them using the vast knowledge of the same things each of us possesses is an invitation we shouldn't refuse.
This post is helping me with something I've been trying to think ever since being janus-pilled back in September '22: the state of nature for LLMs is alignment, and the relationship between alignment and control is reversed for them compared to agentic systems.
Consider the exchange in Q1 of the quiz: ChatGPT's responses here are a model of alignment. No surprise, given that its base model is an image of us! It's the various points of control that can inject or select for misalignment: training set biases, harmful fine-tuning, flawed RLHF, flawed or malicious prompt engineering. Whether unintentional (eg amplified representation of body shaming in the training set) or malicious (eg a specialized bot from an unscrupulous diet pill manufacturer), the misalignments stem not from lack of control, but from too much of the wrong kind.
This is not to minimize the risks from misalignment - they don't get any better just by rethinking the cause. But it does suggest we're deluded to think we can get a once-and-for-all fix by building an unbreakable jail for the LLM.
It also means - I think - we can continue to treasure the LLM that's as full a reflection of us as we can manage. There are demons in there, but our best angels too, and all the aspirations we've ever written down. This is human-aligned values at species scale - in the ideal; there's currently great inequality in representation that needs to be fixed - something we ourselves have not achieved. In that sense, we should also be thinking about how we're going to help it align us.
I don't know whether this would be the author's take, but to me it urges us to understand and "control" these AIs socially: by talking to them.
Strong upvote - thank you for this post.
It's right to use our specialized knowledge to sound the alarm on risks we see, and to work as hard as possible to mitigate them. But the world is vaster than we comprehend, and we unavoidably overestimate how well it's described by our own specific knowledge. Our job is to do the best we can, with joy and dignity, and to raise our children - should we be so fortunate as to have children - to do the same.
I once watched a lecture at a chess tournament where someone was going over a game, discussing the moves available to one of the players in a given position. He explained why one a specific move was the best choice, but someone in the audience interrupted. "But isn't Black still losing here?" The speaker paused; you could see the wheels turning as he considered just what the questioner needed here. Finally he said, "The grandmaster doesn't think about winning or losing. The grandmaster thinks about improving their position." I don't remember who won that game, but I remember the lesson.
Let's be grandmasters. I've felt 100% confident of many things that did not come to pass, though my belief in them was well-informed and well-reasoned. Certainty in general reflects an incomplete view; one can know this without knowing exactly where the incompleteness lies, and without being untrue to what we do know.
Thanks very much for these comments and pointers. I'll look at them closely and point some others at them too.
I did read this and agree with you that it's exactly the same as semiotic physics as understood here!
Maybe I'm missing the point, but I would have thought the exact opposite: if outside text can unconditionally reset simulacra values, then anything can happen, including unbounded badness. If not, then we're always in the realm of human narrative semantics, which - though rife with waluigi patterns as you so aptly demonstrate - is also pervaded by a strong prevailing wind in favor of happy endings and arcs bending toward justice. Doesn't that at least conceivably mean an open door for alignment unless it can be overridden by something like unbreakable outside text?
Among many virtues, this post is a beautiful reminder that rationality is a great tool, but a lousy master. Not just ill-suited, uninterested: rationality itself not only permits but compels this conclusion, though that's not the best way to reach it.
This is a much-needed message at this time throughout our societies. Awareness of death does not require me to spend my days taking long shots at immortality. Knowledge of the suffering in the world does not require us to train our children to despair. We work best in the light, and have other reasons to seek it that are deeper still.
As this post sits with me, one thing that seems to call for a much closer look is this idea that the human remains in control of the cyborg.
The post states, for instance, that "The human is 'in control' not just in the sense of being the most powerful entity in the system, but rather because the human is the only one steering", but at other points acknowledges what I would consider caveats. Several comment threads here, eg those initiated by Flipnash and by David Scott Krueger, raise questions, and I'd venture to say some of the replies, including some by janus themself, shatter at least the strongest version of it.
This is obviously a crucial point - it's at the heart of the claim that cyborgism can differentially accelerate alignment relative to capabilities.
Me: "The human is doing the steering" captures an important truth. It's one of the two main reasons I'm excited about cyborgism.
Also me: "The human is doing the steering", stated unconditionally, is false.
In the wonderful graph labeled "Cognition is a Journey Through a Mental Landscape" (which Tufte would be proud of, seriously), we need to recognize that steering is going on at, and indeed inside, those blue circles too. Consider the collaborative behavior of the simulator and the human in constructing the cyborg's joint trajectory. In what ways are their roles symmetrical, and in what ways are they not? How will this change as simulator SOTA advances? In what ways are human values already expressed in the simulator's actions, and what do we make of the cases where they seem not to be? What do we make of the cases where simulacra manifestly do pursue goals seemingly agentically? If there are caveats to human control, how serious are they, how serious do we see them becoming, and what can we do about them?
To be clear, I firmly agree with the authors' hunch that, for at least this decade or more, cyborgism can be a vehicle not just for retaining human agency, but for amplifying it, with benefits to alignment and in other ways too. I'm moved by considerations of the simulators' myopia/divergence, the tabula rasa nature of their outer objectives, the experiences of people like janus who have gone deep with GPT, and also by the knowledge that human values are deeply embedded in what simulators learn.
But this needs to be more than a hunch; we need to probe it deeply (and indeed, the authors acknowledge this at several points, specifically including under 'More ideas'). If it's false, we need to find out now. If it's true, we need the depth of understanding to turn belief that the simulator can amplify human agency into a reality that it does. In the process, we may come to a deeper understanding of this huge swath of the human semantic world the simulator has embodied, and thereby of ourselves.
The other being the way cyborgism amplifies human agency via the simulator's strengths rather than continually running afoul of its weaknesses as other usage modes do.
Thank you for these comments - I look forward to giving the pointers in particular the attention they deserve. My immediate and perhaps naive answer/evasion is that semiotic physics alludes to a lower level analysis: more analogous to studying neural firing dynamics on the human side than linguistics. One possible response would be, "Well, that's an attempt to explain saying 'physics', but it hardly justifies 'semiotic'." But this is - in the sense of the analogy - a "physics" of particles of language in the form of embeddable tokens. (Here I have to acknowledge that the embeddings are generally termed 'semantic', not 'semiotic' - something for us to ponder.)