Oops. Then I don't get what techniques you are proposing. Like, most techniques that claim to work for superintelligence / powerful agents also claim to work in some more limited manner for current agents (in part bc most techniques assume that no phase change occurs between now and then, or that the phase change doesn't affect the technique => the technique stops working in a gradual manner and one can do empirical studies on current models).
And while there certainly is some loss function or initial random seed for current techniques that gives you aligned superintelligence, there's no way to find them.
With pseudo-kindness I mean any proxy for kindness that's both too wrong to have any overlap with kindness when optimized for by a superintelligence, and right enough to have overlap with kindness when optimized for by current LLMs.
Kindness is some property that behavior & consequences can exhibit. There are many properties in general, and there are still many that correlate strongly on a narrow test environment with kindness. Some of these proxy properties are algorithmically simple (and thus plausibly found in LLMs and thus again in superintelligence), some even share subcomputations/subdefinitions with kindness. Theres some degree of freedom argument about how many such proxies there are. Concretely one can give examples, e.g. "if asked, user rates the assistant's texts as kind" is a proxy that correlates well with the assistant's plans being kind / having kind consequences.
Wrt corrigibility: I don't see why corrigibility doesn't have the same problems as kindness. It's a less complex and human-centric concept than kindness, but still complex and plausibly human-centric (e.g. "do what I mean" style logic or "human-style counterfactuals"). Plausibly it might also be not human-centric or not much at least, i.e. a wide class of agents would invent the same concept of corrigibility and not different versions.
Proxies of corrigibility during training still exist, and tails still come apart.
Empirically, current LLM behavior is better predicted by a model
than by a model
The second model under capability growth indeed can yield a capable reasoner steered by reflexes towards approximate true kindness. And if we get enough training before ASI, the approximation can become good enough that due to discreteness or attractors it just is equal to true kindness.
The first model just generalizes to a capable misaligned reasoner.
I expect that all processes that promote kind-looking outputs route either through reflexes towards pseudo-kindness, or through instrumental reasoning about pseudo-kindness and kindness. Reflexes towards true kindness are just very complex to implement in any neural net, and so unlikely to spontaneously form during training since there's so many alternative pseudo-kindness reflexes instead one could get. Humans stumbled into what we call kindness somehow, partially due to quirks in evolution vs SGD like genome size or the need for cooperation between small tribes etc. Now new humans acquire similar reflexes towards similar kindness due to their shared genes, culture and environment.
Reinforcing kind-looking outputs in AI just reinforces those reasoning processes and reflexes towards pseudo-kindness. The reasoning to true kindness is quite robustly well-performing, while reflexes or reasoning towards pseudo-kindness may lead to not-kind-looking outputs even during training already if the data distribution shifts a bit. Still, there's enough versions of pseudo-kindness that even this kind of robustness doesn't narrow down on true kindness.
Both reflexes to pseudo-kindness and reasoning about true-/pseudo- kindness however generalize not the way we want once the AI's environment shifts due to e.g. a treacherous turn becoming possible or the AI's world model growing a lot larger, or various other effects that happen on the way to superintelligence.
Pseudo-kindness becomes something orthogonal i.e. promotes actions we don't care about (i.e. fill the lightcone with computations we don't view as being even partially about kindness anymore and at most a bad imitation that got crucial details wrong). Reasoning for instrumental reasons just ceases to happen once the instrumental reasons no longer apply, e.g. bc the AI can now pursue plans regardless of human approval due to deception/anticipated takeover.
My unconfident best guess after skimming this post (sry) is that you implicitly assumed that reflexes towards true kindness are available for reinforcement.
It becomes a bit more like logical inductors.
If logical inductors is what one wants, just do that.
a reasonable time-penalty
I'm not entirely sure, but I suspect that I don't want any time penalty in my (typical human) prior. E.g. even if quantum mechanics takes non-polynomial time to simulate, I still think it a likely hypothesis. Time penalty just doesn't seem to be related to what I pay attention to when I access my prior for the laws of physics / fundamental hypotheses. There's also many other ideas for augmenting a simplicity prior that fail similar tests.
Concretely I guess current tech can get a message out to a few targets at 10^3 to 10^6 light years distance. ASI can use many physical probes near light speed, accelerated using energy from a Dyson swarm, so I'd guess it's a few years behind only. I don't expect there to be aliens within 10^6 lys, nor do we know which stars, and it's again unlikely that they happen to be in the thin window of technological development where a warning message from us helps them.
There already are optional expiration dates, and I use them a lot whenever I can guess the next time new information may become available that I'd want to have a chance to react to.
Half of time remaining is a good guess, for at least about 33% of my bids I'd say (?) So this would speed up bidding somewhat for me, and likely get a lot of people who don't use expiration dates to use them, and save their skin.
For the IMO markets one problem would be that the markets were set up to close at EOY, not during the IMO. I do not know why. So for anyone who bid in 2025 the default proposal would've been too late regardless.
I recently collected a large premium on reaction time for the various IMO gold manifold markets bc I coincidentally saw OpenAI's announcement on twitter only 10min late.
Collecting the premium involved filling various limit orders that I confidently guess were long outdated, i.e. that would've been removed a week before the IMO if the owner had bothered to think about them at that time, in addition to being short outdated, i.e. removal after the OpenAI announcement if only the owner had as fast a reaction time as I lucked into.
I expect for long running markets, 24h batched auctions increase liquidity, and thus accuracy, bc more aggressive limit orders / price-volume tails will get specified, since there's less risk of forgetting to cancel the limit order before the information landscape changes. I.e. right now there's often less intelligence and information contained in old limit orders vs new orders bc of reaction times, and so aggregating all information should "weigh" them less in a smart way (i.e. take older orders as less strong Bayesian evidence). Often I think we don't care for super fast aggregation, and care more that traders get higher incentives to provide aggregate-able evidence e.g. via specifying their confidence curves at lower friction from platform UX and adversarial selection in one direction (i.e. if you specify two tails to capture volatility, one gets bought up entirely if you react slower than others to new information). Which is why expect that batched auctions would lower that friction, and lead to higher liquidity and more accurate forecasts, at merely lower time resolution.
I'm honestly surprised that other people haven't covered this question on LW before, since it feels like very centrally in the space of questions LW folks tend to be interested in.
Your core argument was familiar to me, but I don't recall immediately if or where I encountered it on LW before. I strongly associate it with the sequences on induction and Occam's razor, and with List of Lethality, but I doubt that they mention an anthropic filter argument about what environment minds exist in, and probably it's just that I read those documents around the time I contemplated Wigner's puzzle myself.
Kindness also may have an attractor, or due to discreteness have a volume > 0 in weight space.
The question is if the attractor is big enough. And given how there's various impossibility theorems related to corrigibility & coherence I anticipate that the attractor around corrigibility is quite small, bc one has to evade various obstacles at once. Otoh proxies that flow into a non-corrigible location once we ramp up intelligence, aren't obstructed by the same theorems, so they can be just as numerous as proxies for kindness.
Wrt your concrete attractor: if the AI doesn't improve its world model and decisions aka intelligence, then it's also not useful for us. And a human in the loop doesn't help if the AI's proposals are inscrutable to us bc then we'll just wave them through and are essentially not in the loop anymore. A corrigible AI can be trusted with improving its intelligence bc it only does so in ways that preserve the corrigibility.