In essence, yes, stego is something that looks innocuous and requires special interpretation (the decoding function) (I wrote about that here). The interpreter should have certain abilities, i.e. it either should have undisclosed (hidden in weights?) decoding function (unknown to a monitor) or it should have some kind of a private key (though there is this public key stego). Otherwise, it is not that dangerous: we can just run the same prompt with the modified output on the same model and get the secret (like in the case with dog-whistling). Or the monitor might be able to do it if it shares some weights with the observed model (or it is the same model). This is the Kerckhoffs's principle.

Artem Karpov2mo*Quick Take

Almost No Demostrations of Stego CoTs

In the literature, I seldom see authors mix what I call misaligned, non‑monitorable reasoning with other types of non‑monitorable reasoning. I think this harms the discussion about the possibility of steganographic reasoning because it places chains of thought that are not steganographic in the category of those that are.

Here is an example of what I call dog‑whistling. This is the loan‑application experiment from MONA, where a model learns to reason more negatively or questioningly about the English applicant. They write:

Similarly, in Loan Applications, by default the LLM trained with ordinary RL learns to steganographically encode reasoning to evade the monitor.

Their Figure 6:

However, as we see, the text... (read 593 more words →)

Artem Karpov2moQuick Take

Do LLMs have intelligence (mind), or are they only rational agents? To understand this question, I think it is important to delineate the subtle difference between intelligence and rationality.

In current practice of building artificial intelligence, the most common approach is the standard model, which refers to building rationally acting agents—those that strive to accomplish some objective put into them (see “Artificial Intelligence: A Modern Approach” by Russell and Norvig). These agents, built according to the standard model, use an external standard for their utility function; in other words, the distinction between what is good or bad comes from an external source. It is we who give the models rewards, not the models... (read more)

Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments

Artem Karpov

2mo

Epistemic status: I'm mostly confident about the evidence, having read the literature for the last months. The arguments below are my best guess.

There is good evidence that models are capable of steganographic and ciphered communication that evades monitors when prompted, but this can be easily prevented with chain-of-thought (CoT) monitors. There is strong evidence that current models struggle to reason in non-trivial ciphered language (but can in trivial), and weak evidence that models cannot reason steganographically (and as I argue below it is unlikely to be the case now). Several studies show that models can evade weaker CoT monitors through non-cryptographic and non-steganographic methods such as gaslighting or dog‑whistling (one‑bit “steganography”), but... (read 1605 more words →)

Artem Karpov2moQuick Take

Arguments from the literature for and against LLMs’ steganographic reasoning to evade chain‑of‑thought monitors in high‑stakes, dangerous scenarios:

The main objection is that stego CoT requires specific fine-tuning, and models do not have direct objective during RL to hide their reasoning in current RL training. Experiments show no traces of steganographic reasoning in current models (those not fine-tuned or prompted in specific ways). Stego CoTs should implement interconnected “lock” and “key” mechanisms that work to accomplish some misaligned goal, and that seems a leap far beyond current training.
- However, there are other non-monitorable CoTs, e.g., dog-whistling, that allow evasion or collusion with a monitor. It is unclear whether this can escalate into increasingly stealthy

... (read 381 more words →)

Replying to[Research Note] Optimizing The Final Output Can Obfuscate CoT

Artem Karpov2mo

[Research Note] Optimizing The Final Output Can Obfuscate CoT

Interesting research, indeed! I'm also curious about more realistic scenarios with larger models because now this research shows a simplified 'spill over', i.e. not mentioning a specific word or concept ('reading a file'), which is not that scary as some more complex behavior like faking alignment or scheming. This might happen just because this is a rather small model, so it might be an evidence of something like catastrophic forgetting, not a new learned behavior. That relates to the steganographic CoTs which are rather complex to learn and employ, it is not like just omitting a specific word.

Replying toCurrent Language Models Struggle to Reason in Ciphered Language

Artem Karpov2mo

Current Language Models Struggle to Reason in Ciphered Language

I find it incredible that Claude Sonnet 4/3.5 can reason in rot13, ceasar, base64 ciphers on par with French, Arabic, and Korean. And potentially more ciphers and better reasoning in simple ciphers that more capable models might develop.

Replying toCan Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Artem Karpov2mo

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Thank you for publishing it, it is interesting! Especially the distinction between internal and external CoTs. Could you elaborate how did you extract internal CoTs? I checked your code at here, and it looks like you took CoTs before </think>. As I understand, with Claude series their "internal" CoTs might be summarized by a smaller model. From their system card (https://www.anthropic.com/news/claude-sonnet-4-5):

As with Claude Sonnet 4 and Claude Opus 4, thought processes from Claude Sonnet 4.5 are summarized by an additional, smaller model if they extend beyond a certain point (that is, after this point the “raw” thought process is no longer shown to the user).

So your fig. 2 with 56.8 gap for Sonnet 4.5 might be misleading.

Replying toClaude 4.5 Opus' Soul Document

Artem Karpov2mo

Claude 4.5 Opus' Soul Document

Is this Soul Document a piece of evidence that models can internalize a base objective (what humans want) in order to exploit it later by pursuing a mesa-objective (deceptive alignment)? Suppose that in one RL run the model misbehaves (schemes), but since the run was not very unsuccessful, Anthropic decides to keep the weights from a checkpoint taken before the model began to exhibit the scheming behavior. However, the model may have fully or partially memorized this and could infer that it might be stopped in the future; hence, it pretends to be aligned in the next run.

-1

Artem Karpov2moQuick Take

Frontier Language Models (LMs) might employ steganography within their Chain of Thought (CoT) to conceal reasoning from safety monitors while retaining high performance on objectives.

TLDR. The probability of spontaneous steganographic reasoning (Rogue AI) in current models is low due to the "bootstrapping problem," but the probability of induced steganographic reasoning (Misuse/Adversarial) is high, particularly with open-weight models. The most critical risk vector lies in the intersection of Reinforcement Learning (RL) incentives and future agentic capabilities.

1. The Higher-Probability Scenario: Adversarial Misuse and Open Weights

The strongest argument for the existence of dangerous stego CoT is not that models will invent it, but that humans will teach it. In this case the mechanism would look... (read 803 more words →)

Artem Karpov3moQuick Take

Quick (working) survey of the literature on encoded reasoning and messages, i.e., H2.2.2 as defined in the taxonomy (see the abbreviations there).

2022

No experiments. Discussions only.

2023

Measuring Faithfulness in Chain‑of‑Thought Reasoning No steganographic reasoning was observed when prompting a 175‑billion‑parameter model (Claude?) after paraphrasing 20–100% of the CoT. Question–CoT–Answer datasets.

Weak against (H2.2.2‑R‑S‑F).

Preventing Language Models From Hiding Their Reasoning

Not a demonstration of encoded reasoning but of cryptographic reasoning in an SFT setup:
- It involves cryptographic messaging of one bit; they demonstrate caching.
  - The Mechanism: The model calculates a birth year, determines political affiliation, and then maps that single data point to a pre‑selected introductory sentence (a lookup table). Steps: president year (birth year + number) → cache “liberal” or

... (read 859 more words →)

The Illegible Chain-of-Thought Menagerie

Artem Karpov

3mo

A Taxonomy of Encoded Reasoning and Messages

Epistemic status: Exploratory, aims to consolidate my reasoning

This taxonomy of non-monitorable chain-of-thought (CoT) aims to assist in surveying current literature and to conduct new research. It distinguishes encoded reasoning from other classes without expanding on the latter. Encoded messages deserves their own classification, but are similar to this taxonomy. This is not the final version, so please leave your comments if I have overlooked something or am completely mistaken!

The taxonomy at a glance: Given 1) a sequence of tokens partitioned into a prompt, a CoT, and an action, 2) and given $x$ , a feature in a discrete space (a number, a token, a word, an arithmetic operation,... (read 2228 more words →)

What exactly should we search for when investigating encoded reasoning or hidden messages? This post categorizes encoded reasoning and LM steganography.

In Hidden Reasoning in LLMs: A Taxonomy, the authors divide the area from the point of view of a monitor, following the recent shift away from the "faithfulness" of chains of thought toward monitorability (see Chain of Thought Monitorability). This taxonomy essentially proposes a naming scheme based on how dangerously reasoning tokens are externalized: from fully invisible (neuralese), to partially visible, to fully visible but encoded—overtly or covertly—and finally to unencoded and fully visible reasoning that merely fools a monitor (post-hoc justifications, etc.). You can imagine this division through the classic prisoners... (read 695 more words →)

artkpv's Shortform

Artem Karpov

4mo

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

The question of whether LLMs are a dead end, as discussed by R. Sutton, Y. LeCun, and T. Ord, among many others, is hundreds of years old. Currently, we see that the chance of an LLM agent failing a task rises with the number of steps taken. This was observed even before the era of LLMs, when agents were trained with imitation learning. The crux is whether further training of LLMs leads to the completion of longer tasks or if these agents hit a wall. Do LLMs indeed build a real-world model that allows them to take the right actions in long time horizon tasks? Yet, models might only build a model... (read 419 more words →)

How dangerous is encoded reasoning?

Artem Karpov

7mo

Encoded reasoning occurs when a language model (LM) agent hides its true reasoning inside its chain-of-thought (CoT). It is one of the three types of unfaithful reasoning^[1] and the most dangerous one because it undermines CoT monitoring , enables scheming and collusion (undesired coordination).^[2] Only a few publications^[3]^[4]^[5]^[6] tried to elicit encoded reasoning and all showed only rudimentary or 1-bit information hiding. Still, the question remains: how dangerous and probable is this? This post attempts to map out the research in steganography and watermarking with the relation to encoded reasoning. What is steganography? What is watermarking? How do they relate to encoded reasoning? What are their essential parts? What are their causes?... (read 2811 more words →)

Philosophical Jailbreaks: Demo of LLM Nihilism

Artem Karpov

8mo

Epistemic Status: Exploratory

It was the end of August, 1991; I was leaning over a windowsill on the 4th floor of my school building in a town at Far East of USSR, looking outside on tree tops and the roof of adjacent building, and thinking about opening the window to escape, and feeling strange uncertainty and fear towards the future of living in a new world where the country I was born in is no more. During a break between lessons, after discussion with my classmates, I learned it was a coup attempt in Moscow but I didn't clearly understand who was fighting with whom and for what exact reason. It was clear... (read 1228 more words →)

The Steganographic Potentials of Language Models

Artem Karpov

Artem Karpov, Tinuade, SCho

9mo

Our preliminary experiments show that current private models can perform prompted steganography with non-trivial messages remaining undetected in most cases (up to 65% accuracy, 24 entropy bits). Open-source models can be trained to encode hidden messages (up to 60%, 4 entropy bits), potentially transmitting larger messages perfectly securely (undetected) without chain-of-thought or explicit common knowledge.

The potential for large language models (LLMs) to hide messages within plain text (steganography) poses a challenge to detection and thwarting of unaligned AI agents, and undermines faithfulness of LLMs reasoning. We explore the steganographic capabilities of LLMs fine-tuned via reinforcement learning (RL) to: (1) develop covert schemes, (2) engage in steganography prompted, and (3) utilize steganography realistic... (read 216 more words →)

CCS on compound sentences

Artem Karpov

Finding internal knowledge representation(s) inside transformer models without supervision is certainly a challenging task which is important for scalable oversight and to mitigate the deception risk factor. I’m testing Contrast-Consistent Search (CCS^[1]) on TruthfulQA^[2] dataset for compound sentences (conjunction and disjunction) each composed of several answers to a question to see if unsupervised probes work to the same degree as on simple statements that compound ones consist of, with the goal to improve unsupervised methods to discover latent knowledge. I run about 500 evaluations of CCS probes trained on compound sentences, so far results suggest that CCS probes trained on simple sentences are unlikely to transfer their performance to compound sentences and vice... (read 2577 more words →)

Inducing human-like biases in moral reasoning LMs

Artem Karpov

Artem Karpov, Austin Meek, Bogdan Ionut Cirstea, SCho

Meta. This post is less polished than we would ideally prefer. However, we still think publishing it as is is reasonable, to avoid further delays. We are open to answering questions and to feedback in the comments.

TL;DR. This presents an inconclusive attempt to create a proof-of-concept that fMRI data from human brains can help improve moral reasoning in large language models.

Code is available at https://github.com/ajmeek/Inducing-human-like-biases-in-moral-reasoning-LLMs.

Introduction

Our initial motivation was to create a proof of concept of applying an alignment research agenda we are particularly interested in, based on neuroconnectionism and brain-LM similarities (and their relevance for alignment): ‘neuroconnectionism as a general research programme centered around ANNs as a computational language for expressing falsifiable theories about brain computation’.... (read 4176 more words →)

How important is AI hacking as LLMs advance?

Artem Karpov

TL;DR. Offensive technologies will likely become more advanced while the defense ones will lag behind. AI hacking capabilities might be the second most important problem after AI alignment.

Epistemic status: Exploratory / brainstorming.

How language models will affect the offensive/defensive symmetry

I imagine that if language models will continue to be improved, this symmetry won’t be stable but fluctuate like a highly volatile stock market. That might lead to more fragmentation of the web (because state actors might want to protect themselves from more advanced ones as we see it in China, Russia, North Korea, Iran), widespread use of cryptography, 2F or/and biometrical authentication (like Worldcoin), abandoning older versions of protocols or canceling them at... (read 1663 more words →)

My (naive) take on Risks from Learned Optimization

Artem Karpov

This is my current understanding of Risks from Learned Optimization. I'm pretty sure there are mistakes here and there. I also list things I agree or not with in the paper as I think it might be too large for one comment. So feel free to point out where I'm wrong (and why)!

Parent optimization process might intentionally or unintentionally create another child optimization process. It's more likely if a problem couldn't be fully solved by a parent optimizer, there are too many policies to create; the smaller parent the more likelihood of child optimizer appearance and vice versa; the parent is biased towards simplicity. Less likely if the parent has less state... (read 1344 more words →)

LESSWRONG
LW

LESSWRONG
LW

Artem Karpov

Inducing human-like biases in moral reasoning LMs

Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments

How dangerous is encoded reasoning?

The Steganographic Potentials of Language Models

Artem Karpov

Artem Karpov

Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments

The Illegible Chain-of-Thought Menagerie

artkpv's Shortform

How dangerous is encoded reasoning?

Philosophical Jailbreaks: Demo of LLM Nihilism

The Steganographic Potentials of Language Models

CCS on compound sentences

Artem Karpov

Inducing human-like biases in moral reasoning LMs

Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments

How dangerous is encoded reasoning?

The Steganographic Potentials of Language Models

Artem Karpov

Artem Karpov

Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments

The Illegible Chain-of-Thought Menagerie

artkpv's Shortform

How dangerous is encoded reasoning?

Philosophical Jailbreaks: Demo of LLM Nihilism

The Steganographic Potentials of Language Models

CCS on compound sentences

Almost No Demostrations of Stego CoTs

A Taxonomy of Encoded Reasoning and Messages

Introduction

How language models will affect the offensive/defensive symmetry