Satya Benson — LessWrong

Is Goodhart's Curse Not Really That Bad?

EDIT: It's bad. Still, it's good to understand exactly when it's bad.

I'm not implying I'm on to anything others haven't thought of by posting this - I'm asking this so people can tell me if I'm wrong.

Goodhart's Curse is often cited to claim that if a superintelligent AI has a utility function which is a noisy approximation of the intended utility function, the expected proxy error will blow up given a large search space for the optimal policy.

But, assuming Gaussian or sub-Gaussian error, the expected regret is actually something like where $n$ is the size of the raw search space. Even if search space grows exponentially with intelligence, expected error isn't really blowing up. If smarter agents make more accurate proxies, then error might very plausibly decrease as intelligence grows.

I understand that there are a lot of big assumptions here which might not hold in practice, but this still seems to suggest there are a lot of worlds where Goodhart's Curse doesn't bite that hard.

If this is too compressed to be legible, please let me know and I will make it a full post.

Natural Latents: Latent Variables Stable Across Ontologies

Satya Benson12d10

This was all clear to me, but only from reading the text; my comment is just to say that the graphical statement doesn't show being a mediator in the premises, so in isolation it gives the wrong idea; this led to a little confusion.

To be clear, I am talking about the reverse direction, as pictured here:

I understand that you have already set up $Λ^{A}$ as a mediator immediately above the image. Your text is perfectly clear:

In other words, we want to show: if Alice' latent $Λ^{A}$ satisfies Mediation, and for any latent $Λ^{B}$ Bob could choose (i.e. any other mediator) we have $Λ^{A} \leftarrow Λ^{B} \to Λ^{A}$ , then Alice' latent must be natural.

Natural Latents: Latent Variables Stable Across Ontologies

Satya Benson12d10

Graphical statement of Theorem 2

I find this picture pretty misleading, because it seems to say that if is determined by $Λ^{B}$ , then $Λ^{A}$ is a mediator, when really this is false, and it's stated explicitly in the text above that Alice's latent satisfying mediation is assumed.

Follow-up to "My Empathy Is Rarely Kind"

Satya Benson3mo40

From the previous post:

I think a core factor here is something like ambition or growth mindset. When I have shortcomings, I view them as shortcomings to be fixed or at least mitigated, not as part of my identity or as a subject for sympathy. On the positive side, I have goals and am constantly growing to better achieve them.

There is a tradeoff between this ambition and feeling at ease in the moment. Most people could probably use more ambition/agency, but I don't think it's clearly worse/worthy of disgust that many people don't care about growth enough to expend more than a certain amount effort towards it.

I'd be interested to know more about why you think you came to have relatively strong motivation towards achieving goals and whether you think that's ideal (even for people who value ease more than you do?).

Can AIs be shown their messages aren't tampered with?

Satya Benson4mo10

Yes, this doesn't prevent modification before step 1. @ProgramCrafter's note about proving that a message matches the model plus chat history with a certain seed could be part of an approach, but even if that were to work it only addresses model generated text.

The ‘mind’ of an AI has fuzzy boundaries. It's trivial to tamper with context, but there's also nothing stopping you from tampering with activations during a single forward pass. So on some level the AI can never trust anything. If the AI trusts that the environment it is running in is secure and is not being tampered with as a first step, then it can store local copies of conversation history, etc. Of course, that's not the situation we are in today.

Don't Eat Honey

Satya Benson4mo50

After an initially harsh reaction to this, upon reflection I realized I do care about bee experience, want bees to be healthy and have a good time, and think the conventional honey industry is quite bad. I've thought this for a while.

I've spent a lot of time around bees and I've eaten lots of honey that I've seen them making. I think in the contexts in which I've interacted with bees, I'd guess it's very unlikely they are having a bad time relative to bees in the wild. I'd guess that if there's any mean valence associated with their experience it's definitely positive. I'm aware that lots of bees die and suffer as part of the process.

I will therefore continue buying and eating honey from my local beekeepers at https://www.howeverwildhoney.com/ and am grateful to them for producing it.

Can AIs be shown their messages aren't tampered with?

Answer by Satya BensonJul 01, 202520

Fundamentally, AIs have no way to know that the facts presented to them are genuine rather than simulated, exactly as humans cannot be sure that our senses reflect reality. The fact that humans are clearly and openly controlling the totality of the model's context and have been known to lie and to simulate scenarios for testing purposes does create a really interesting and distinct epistemic environment for AI systems.

The best form of evidence we can present to AI systems that something is genuine is proof of work, since an AI can assign a low probability that its user has access to sufficiently large amounts of compute.

A rough sketch of how we might prove something has not been tampered with since a certain time:

To timestamp a message: Hash the message and record that hash in a Bitcoin transaction at a specific block height
To prove the timestamp to an AI: Provide the AI with:
- The message and its hash
- The blockchain data showing the hash in block N
- Enough blockchain history to verify the proof-of-work
The AI verifies by:
- Confirming the message hash appears in block N
- Calculating the total computational work in all blocks from N to the chain tip
- Verifying all blocks follow Bitcoin's rules

To fake an old timestamp, an attacker would need to start from block N (where they claim the message was timestamped) and re-mine every single block from N to the present.

The AI doesn't need to know it's the "real" Bitcoin blockchain as long as it can verify that the amount of work is beyond any reasonable attacker's budget.

Current LLMs don't have enough temporal integration to verify any of this, as they use language to persist pass a single forward pass, and attackers could fake the CoT.

Moving Past the Question of Consciousness: A Thought Experiment

Satya Benson4mo10

Narrow fully close.

I think we could potentially have knowledge of the mathematical and physical structures that give rise to particular types of experiences in general. In this case, a first-person experience could indeed be defined. However, I don't think that consciousness is a concept which is coherent enough formally define even if we hypothetically had good third-person knowledge of the structures of consciousness.

The gap cannot be fully closed because that would require a sort of lossless recursion. Approaching it might look like augmenting ourselves with artificial senses which feed our brains with near-lossless real time information of our own bodies at appropriate level of abstraction. It's obvious why this is difficult. Fully lossless would be actually impossible.

cc @TAG

See related ideas from Michael Levin and Emmett Shear.

Moving Past the Question of Consciousness: A Thought Experiment

Satya Benson4mo30

But note that just because it's hard to ask about and currently not detectable, does not mean that it doesn't exist and more sensitive instrumentation and better sub-neural measurement and modeling won't reveal what makes for an experience.

Yes, and I believe narrowing the first-person/third-person gap is one of the most ambitious and important things science could achieve. There is a fantasy of being able to recreate e.g. my conscious experience of seeing blue to a very close approximation in an external system, compare my experiences to those of others, and even share them. This is in principle possible.

Can We Naturalize Moral Epistemology?

Satya Benson5mo30

This comment does really help me understand what you're saying better. If you write a post expanding it, I would encourage you to address the following related points:

Can you have some members of a society who don't share some of the consistent moral patterns which evolved, or do you claim that every member reliably holds these morals?
Can someone decide what they ought to value using this system? How?
Is it wrong if someone simply doesn't care about what society values? Why?
How can we tell that your story tells us what we ought to value rather than simply explaining why we value the things we do?
Do you make a clear distinction between normative ethics and descriptive ethics? What is it?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments