we are deliberately seeking to build certainties of mind
I think "deliberately seeking to build" is the wrong way to frame the current paradigm - we're growing the AIs through a process we don't fully understand, while trying to steer the external behaviour in the hopes that this corresponds to desirable mind structures.
If we were actually building the AIs, I would be much more optimistic about them coming out friendly.
Analogy to humans feels like generalizing from one example to me. My prior is that minds evolved under different circumstances will have different desires, so we shouldn't expect an AI to robustly share any specific human value unless we can explain exactly how it develops that value.
But that aside, would you agree that if this were true, alignment should be fairly easy, because we just need to amplify the degree of caring?
I think it's plausible that an AI that has this kind of caring could exist, but actually getting this AI instead of one that doesn't care at all seems very unlikely.
IMO "respect the actual preferences of existing intelligent agents" is a very narrow target in mind-space. I.e. if we had any reason to believe the AI has a decent chance of being this kind of mind, the alignment problem would be 90% solved. The hard part is going from "AI that kills everyone" to "AI that doesn't kill everyone". Once you're there, getting to"AI that benefits humanity, or at least leaves for another star system" is comparatively trivial.
It feels like my key disagreement is I think that AI will be able to come up with strategies that are inhuman without being superhuman. i.e. human-level AIs will find strategies in a very different part of solution space than what humans would naturally think to prepare for.
My biggest intuition toward the above is AIs' performance in games (e.g. AlphaStar). I've seen a lot of scenarios where the AIs soundly beat top humans not by doing the same thing but better, but by doing something entirely outside of the human playbook. I don't see why this wouldn't transfer to other domains with very large solution spaces, like steganography techniques.
I do agree that it will likely take a lot of work to get good returns out of untrusted monitoring (and by extension general anti-collusion measures). However I think having good anti-collusion measures is a very important capability towards limiting the harm that a rogue AI could potentially do.
That's not how people usually use these terms. The uncertainty about a state of the coin after the toss is describable within the framework of possible worlds just as uncertainty about a future coin toss, but uncertainty about a digit of pi - isn't.
Oops, that's my bad for not double-checking the definitions before I wrote that comment. I think the distinction I was getting at was more like known unknowns vs unknown unknowns, which isn't relevant in platonic-ideal probability experiments like the ones we're discussing here, but is useful in real-world situations where you can look for more information to improve your model.
Now that I'm cleared up on the definitions, I do agree that there doesn't really seem to be a difference between physical and logical uncertainty.
In the case with 1,253,725,569th digit of pi, if I try to construct a probability experiment consisting only of checking this paticular digit, I fail to model my uncertainty, as I don't know yet what is the value of this digit.
Ok, let me see if I'm understanding this correctly: if the experiment is checking the X-th digit specifically, you know that it must be a specific digit, but you don't know which, so you can't make a coherent model. So you generalize up to checking an arbitrary digit, where you know that the results are distributed evenly among {0...9}, so you can use this as your model.
The first part about not having a coherent model sounds a lot like the frequentist idea that you can't generate a coherent probability for a coin of unknown bias - you know that it's not 1/2 but you can't decide on any specific value.
Now, I may worry, that I overdid it and have abstracted away some relevant information, so I check:
- Does knowing that the digit in question is specifically 1,253,725,569 affects my credence?
This seems equivalent to my definition of "information that would change your answer if it was different", so it looks like we converged on similar ideas?
This is physical uncertainty.
I'd argue that it's physical uncertainty before the coin is flipped, but logical certainty after. After the flip, the coin's state is unknown the same way the X-th digit of pi is unknown - the answer exists and all you need to do is look for it.
Looks like Tell me about yourself: LLMs are aware of their learned behaviors investigates a similar topic, but finds the complete opposite result - if you fine-tune an LLM to have a specific unusual behaviour, without explicitly spelling out what the behaviour is, the LLM is able to accurately describe the unusual behaviour.
I wonder if the difference is that they fine-tuned it to exhibit a specific behaviour, whereas you and I were testing with the off-the-shelf model? Perhaps if there's not an obvious-enough behaviour for the AI to hone in on, it can develop a gap between what it says it would do and what it would actually do?
The disconnect between stated and revealed preferences across all models suggests frontier models either misrepresent their actual preferences or lack accurate self-understanding of their own behaviors—both concerning possibilities for alignment research.
Bit of a tangent, but I suspect the latter. As part of my previous project (see "Round 2"), I asked GPT-4o to determine if a code sample was written normally by another instance of GPT-4o, or if it contained a suspicious feature. On about 1/5 of the "normal" samples, GPT-4o incorrectly asserted that the code contained an unusual feature that it would not typically generate.
I'd guess that current-generation LLMs don't really have self-understanding. When asked how they would behave in a situation, they just provide a plausible-sounding completion instead of actually trying to model their own behaviour. If this is true, it would mean that alignment research on LLMs might not generalize to smarter AIs that have self-understanding, which is indeed concerning.
I'd also expect this to be something that has already been researched. I'll do some digging, but if anyone else knows a paper/post/etc on this topic, please chime in.
I'm supposed to account for all the relevant information and ignore all the irrelevant.
Is there a formal way you'd define this? My first attempt is something like "information that, if it were different, would change my answer". E.g. knowing the coin is biased 2:1 vs 3:1 doesn't change your probability, so it's irrelevant; knowing the coin is biased 2:1 for heads vs 2:1 for tails changes your probability, so it's relevant.
Or maybe it should be considered from the perspective of reducing the sample space? Is knowing the coin is biased vs knowing it's biased 2:1 a change in relevant information, even though your probability remains at 1/2, because you removed all other biased coins from the sample space? (Intuitively this feels less correct, but I'm writing it out in the interest of spitballing ideas)
Unrelatedly, would you agree that there's not really a meaningful difference between logical and physical uncertainties? I see both of them as stemming from a lack of knowledge - logical uncertainty is where you could find the answer in principle but haven't done so; physical uncertainty is where you don't know how to find the answer. But in practice there's a continuum of how much you know about the answer-finding process for any given problem, so they blur together in the middle.
The difference with normal software is that at least somebody understands every individual part, and if you collected all those somebodies and locked them in a room for a while they could write up a full explanation. Whereas with AI I think we're not even like 10% of the way to full understanding.
Also, if you're trying to align a superintelligence, you do have to get it right on the first try, otherwise it kills you with no counterplay.