How should a corrigible agent behave when its principal seems self-contradictory? (Either because the principal is a team, or simply because the single-human principal is inconsistent.)
This here, I think, is the single largest problem with trying to make a purely corrigible (as opposed to broadly deferential) agent. Humans are generally inconsistent (and, as a corollary, the observed subset of their behavior may turn out to be inconsistent with their other preferences even if it's consistent internally). But I am not a specialist in all this at all.
I think it’s going to be very typical for different actions to have trajectories that are mutually non-dominant (such as in the example). It matters a lot how you decide such cases, and I expect that almost all such ways of deciding are catastrophic.
My mind just runs around screaming "harmonic serialism". (For an intro, see: McCarthy, John J. (2000). Harmonic serialism and parallelism. In Masako Hirotani, Andries Coetzee, Nancy Hall, and Ji-yung Kim (eds.), Proceedings of the North East Linguistic Society 30. Amherst: GLSA, pp. 501-524. [ROA-357] Does presuppose some linguistic familiarity, so ask away.)
It’s kinda irrelevant whether the training examples are sufficient to teach the system the true distinction between bombs and non-bombs; you can just have an agent which errs extremely hard on the side of sensitivity (at the cost of specificity) and gradually learns to whitelist some things.
...Known as "the trivial recall-maximizing model" in the extreme, this is one of the things you very quickly learn not to hope for I think.
This desideratum is called “behaviorism,” but even B. F. Skinner (probably) would’ve admitted that sometimes an animal is “seeking food” or “seeking shelter,” which, to be blunt, is definitely modeling the animal’s mind, even if it’s couched in language of behavior. I’m not convinced any (normal intelligence) humans are (or ever have been) behaviorists in the way Yudkowsky uses the word, and I leave it to him to argue that this is possible.
...Yes, they have. Behaviorism is recent enough that there have been echoes in my courses of linguistics and psycholog...
For instance, we can see inclusive genetic fitness as akin to a reward function which updates the human policy, but no human actually optimizes inclusive genetic fitness—instead we seek to maximize things which tended to correlate with fitness in the ancestral environment.
Related: https://www.lesswrong.com/posts/XPErvb8m9FapXCjhA/adaptation-executers-not-fitness-maximizers
To a first approximation, I would submit that "a task that requires constant attention" is a natural class while "a task that requires to be distracted once in a while" is not.
produces a multi-dimensional response that includes muscle-action as well as changes to thoughts and memories.
Didn't you say earlier that actions should be "distinguished" from thoughts?
This text reminded me that I feel like some people in ethics/alignment would find learning about Optimality Theory in linguistics (and its pitfalls) useful.
That depends on your definition of Utopia. I think that nearly any human society we would recognize would have zero-sum status games, even if (perhaps especially if!) intelligence and resources are high enough not to worry about survival, even though it would be better not to have them. And an entity with a terminal aim of succeeding in such a status game may find it instrumentally beneficial to "pray the gay away".
(Vibes-level related but distinct: I am actually abnormally skeptical for LW that simple self-modification is a good idea.)
In a literal sense, a corrigible agent is one that can be corrected. But in the context of AI alignment, I believe that the word should mean something stronger than mere correctability.
Can I submit that "correctable" is language misuse? :)
Being a good alignment researcher seems to require a correct understanding of the nature of values. However metaethics is currently an unsolved problem, with all proposed solutions having flawed or inconclusive arguments, and lots of disagreement among philosophers and alignment researchers, therefore the current meta-correct metaethical position seems to be one of confusion and/or uncertainty. In other words, a good alignment researcher (whether human or AI) today should be confused and/or uncertain about the nature of values.
Eh. In any active theoretical...
Some humans attempt to self-modify their core desires for instrumental reasons, but if given enough intelligence and resources they likely wouldn’t persist in having this meta-desire.
I think this is overly optimistic. Some people clearly have a core desire for succeeding for the sake of succeeding, and it may cause this self-modification meta-desire to persist.
I am frankly unconvinced that 75% is something worth celebrating.
Full offense to Health Canada: this is a terrible graphic, because if you don't look at it carefully you will think that the provinces in dark blue have approximately the same number of cases, and this is very false.
Yeah, who even does that?
I am both experienced enough in text-based RP and have interacted with Character.AI enough to confidently assert that LLMs are not categorically different in their output from a poor-memory RPer, despite sometimes clearly different underlying patterns.
If the LLM text contains surprising stuff, and you DID thoroughly investigate for yourself, then you obviously can write something much better and more interesting.
This is false. Dressing up text to be readable is a separate skill not everyone has.
The problem raises an important problem. Though I have to admit my gut reaction is "neurotypicals are being weird again" :)
Damn. And I tried the strategy "what if I try to predict it only off the text, without looking at csv" :D
Why DEX though? Like, conceptually it's absolutely unpredictable, this is one of the most useful scores in most TTRPGs.
But you define policy as a mapping between contexts and actions, not as a mapping between contexts and actions+thoughts.