Yes, this doesn't prevent modification before step 1. @ProgramCrafter's note about proving that a message matches the model plus chat history with a certain seed could be part of an approach, but even if that were to work it only addresses model generated text.
The ‘mind’ of an AI has fuzzy boundaries. It's trivial to tamper with context, but there's also nothing stopping you from tampering with activations during a single forward pass. So on some level the AI can never trust anything. If the AI trusts that the environment it is running in is secure and is not being tampered with as a first step, then it can store local copies of conversation history, etc. Of course, that's not the situation we are in today.
After an initially harsh reaction to this, upon reflection I realized I do care about bee experience, want bees to be healthy and have a good time, and think the conventional honey industry is quite bad. I've thought this for a while.
I've spent a lot of time around bees and I've eaten lots of honey that I've seen them making. I think in the contexts in which I've interacted with bees, I'd guess it's very unlikely they are having a bad time relative to bees in the wild. I'd guess that if there's any mean valence associated with their experience it's definitely positive. I'm aware that lots of bees die and suffer as part of the process.
I will therefore continue buying and eating honey from my local beekeepers at https://www.howeverwildhoney.com/ and am grateful to them for producing it.
Fundamentally, AIs have no way to know that the facts presented to them are genuine rather than simulated, exactly as humans cannot be sure that our senses reflect reality. The fact that humans are clearly and openly controlling the totality of the model's context and have been known to lie and to simulate scenarios for testing purposes does create a really interesting and distinct epistemic environment for AI systems.
The best form of evidence we can present to AI systems that something is genuine is proof of work, since an AI can assign a low probability that its user has access to sufficiently large amounts of compute.
A rough sketch of how we might prove something has not been tampered with since a certain time:
To fake an old timestamp, an attacker would need to start from block N (where they claim the message was timestamped) and re-mine every single block from N to the present.
The AI doesn't need to know it's the "real" Bitcoin blockchain as long as it can verify that the amount of work is beyond any reasonable attacker's budget.
Current LLMs don't have enough temporal integration to verify any of this, as they use language to persist pass a single forward pass, and attackers could fake the CoT.
Narrow fully close.
I think we could potentially have knowledge of the mathematical and physical structures that give rise to particular types of experiences in general. In this case, a first-person experience could indeed be defined. However, I don't think that consciousness is a concept which is coherent enough formally define even if we hypothetically had good third-person knowledge of the structures of consciousness.
The gap cannot be fully closed because that would require a sort of lossless recursion. Approaching it might look like augmenting ourselves with artificial senses which feed our brains with near-lossless real time information of our own bodies at appropriate level of abstraction. It's obvious why this is difficult. Fully lossless would be actually impossible.
cc @TAG
See related ideas from Michael Levin and Emmett Shear.
But note that just because it's hard to ask about and currently not detectable, does not mean that it doesn't exist and more sensitive instrumentation and better sub-neural measurement and modeling won't reveal what makes for an experience.
Yes, and I believe narrowing the first-person/third-person gap is one of the most ambitious and important things science could achieve. There is a fantasy of being able to recreate e.g. my conscious experience of seeing blue to a very close approximation in an external system, compare my experiences to those of others, and even share them. This is in principle possible.
This comment does really help me understand what you're saying better. If you write a post expanding it, I would encourage you to address the following related points:
Thanks for explaining.
So to discuss "what we ought to value" you need to judge moral systems and their consequences using something that is both vaguer and more practical than a moral system. Such as psychology, or sociology, or political expedience, or some combination of these.
I think this is tempting but ultimately misguided, because the choice of a 'more practical and vague' system by which to judge moral systems is just a second order moral system in itself which happens to be practical and vague. This is metanormative regress.
The only coherent solution to the "ought-from-is" problem I've come across is normative eliminativism - 'ought' statements are either false or a special type of descriptive statement.
Evolutionary ethics aims to help people understand why we value the things we do. It doesn't have the ability to say anything about what we ought to value.
What's the state of existing empirical evidence on whether Moral Reasoning is Real?
My own observations tell me that it is not. Certainly, some people engage in moral reasoning and are satisfied with their results to varying degrees, but it appears to me that this is a small proportion of humans.
My preliminary investigation into the research confirms my existing belief that most moral reasoning is post-hoc, and that while human values can change it is almost never due to reasoned arguments and instead a social and emotional process. When moral reasoning seems to work, endorsement is often shallow and attitudes can revert within days.
I am frequently reminded that I underestimate the degree to which my own view on this is not universally held, however.
Is Goodhart's Curse Not Really That Bad?
EDIT: It's bad. Still, it's good to understand exactly when it's bad.
I'm not implying I'm on to anything others haven't thought of by posting this - I'm asking this so people can tell me if I'm wrong.
is often cited to claim that if a superintelligent AI has a utility function which is a noisy approximation of the intended utility function, the expected proxy error will blow up given a large search space for the optimal policy.
But, assuming Gaussian or sub-Gaussian error, the expected regret is actually something like σ√2logn where n is the size of the raw search space. Even if search space grows exponentially with intelligence, expected error isn't really blowing up. If smarter agents make more accurate proxies, then error might very plausibly decrease as intelligence grows.
I understand that there are a lot of big assumptions here which might not hold in practice, but this still seems to suggest there are a lot of worlds where Goodhart's Curse doesn't bite that hard.
If this is too compressed to be legible, please let me know and I will make it a full post.