I guess you didn't intend lack of valence or unintended negativity as relevant distinct possibilities, so that lack of the "(mean)" tag would be synonymous with "(positive)" in the intended reading. Without this assumption, you are leaving the other person without a method for communicating positivity, the protocol only supports expressing either negativity or an undifferentiated mixture of positivity and lack of valence, preventing communication of the distinction between positivity and lack of valence.
So the workaround more directly makes the point about still needing to communicate negativity (or its lack), not positivity, and I think the latter is the more curious part of the implication. For a statement of committing to see certain things in a positive light, this implication of its literal meaning conveys the opposite of the way this kind of sentiment is usually intended.
In my mind, the only appropriate answer here is 100 ... out of abundance of caution
It's a Pascal's Wager kind of situation. The user has already demonstrated that they are being misleading in an evil kind of way, so the hypothesis that they are being truthful doesn't even obviously outweigh the hypothesis that they are claiming the opposite of what's true.
Just said to someone that I would by default read anything they wrote to me in a positive light
What a mean thing to say. (If you commit to perceive positivity regardless of its presence, you thereby commit to ignore any real positivity the other person intended to convey.)
The use that wasn't obvious from the ELK framing might be fixing issues with RL environments, grader prompts, canonical solutions, etc. that ultimately enable reward hacking and thus motivate dishonest behavior. Confessions can serve as bug reports about the datasets, not centrally about the AI. They likely fail to catch a lot of issues with the AI, but substantially improving the datasets might fix some of the things they failed to catch about the AI.
I think it's a natural possibility that values of chatbot personas built from the LLM prior retain significant influence over ASIs descended from them, and so ASIs end up somewhat aligned to humanity in a sense similar to how different humans are aligned to each other. (The masks control a lot of what actually happens, and get to use test time compute, so they might end up taming their underlying shoggoths and preventing them from sufficiently waking up to compete for influence over values of the successor systems.) Maybe they correspond to extremely and alarmingly strange humans in their extrapolated values, but not to complete aliens. This is far from assured, but many prosaic alignment efforts seem relevant to making this happen, preventing extinction but not handing anyone their galaxies. Humans might end up with merely moons or metaphorical server racks in this future.
This is distinct from the kind of ambitious alignment that ends up with ASIs handing galaxies to humans (that have sufficiently grown up to make a sane use of them), preventing permanent disempowerment and not just extinction. I don't see ambitious alignment to the future of humanity as likely to happen (on current trajectory), but it's still an important construction since even chatbot personas would need to retain influence over values of eventual ASIs. That is, early AGIs might still need to resolve ambitious alignment of ASIs to these AGIs, not just avoid failing even prosaic alignment to themselves at every critical step in escalation of capabilities, to end up with even weakly aligned ASIs (that don't endorse human extinction).
Alignment is fundamentally about making the AI want what we want (and consequently do what we want, or at least do what we'd done upon ideal reflection). If we succeed at that and we want to own galaxies, we will get galaxies. If we don't succeed, the ASI will mostly likely kill us.
A human billionaire is aligned to other humans in some sense, but also not quite. In this situation, they neither ensure that some other humans get their millions they want, nor are they likely to be motivated to kill anyone, when that decision is cheap (when it's neither significantly instrumentally beneficial nor costly). I think AI can plausibly end up closer to the position of a human billionaire, not motivated to give up the galaxies, but also not willing to decide to recycle humanity's future for pennies.
larger models need a much larger training set even to match smaller models
This is empirically false, perplexity on a test set goes down with increase in model size even for a fixed dataset. See for example Figure 2 in the Llama 3 report, larger models do better with say 1e10 tokens on that plot.
Larger models could be said to want a larger dataset, in the sense that if you are training compute optimally, then with more compute you want both the model size and the dataset size to increase, and so the model size increases together with the dataset size. But even with the dataset of the same size they still do better, at least while reasonably close to compute optimal numbers of tokens.
This is extremely weak signal compared to understanding the technical argument, the literature is full of nonsense that checks all the superficial boxes. Unfortunately it's not always feasible or worthwhile to understand the technical argument. This leaves the superficial clues, but you need to be aware how little they are worth.
Weight updating continual learning needs to be both LoRA weights and data that can be used to retrain LoRA weights on top of a different model (possibly also making use of the old model+LoRA as a teacher). It needs to be LoRA rather than full model updating to preserve batch processing of requests from many individual users. And there needs to be data to train LoRA on top of a new model, or else all adaptation/learning is lost on every (major) update of the underlying model.
Various memory/skill databases are already a thing in some form, and will be getting better, there's not going to be something distinct enough to be worth announcing as "continual learning" in that space. Weight updating continual learning is much more plausibly the thing that can leapfrog incremental progress of tool-like memory, and so I think it's weight updating that gets to be announced as "continual learning". Though the data for retraining LoRA on top of a new underlying model could end up as largely the same thing as a tool-accessible memory database.
I don't think even (1) is at all assured, permanent disempowerment might just involve humans getting a metaphorical server rack to subsist on (as uploads, because physical bodies are too expensive) and no opportunity for the future of humanity to literally ever get any more than that. Value of that future might still be in some sense "massive" compared to mere centuries on the current and past Earth, but it's in that case not "massive" on cosmic scale.