If you want a mundane existence you can simulate that until you're bored
My mundane values care about real physical stuff not simulated stuff.
Yes, I support something like uplifting, as described in other comments in this post.
yeah I think we should allow christian homeschools to exist in the year 3000.
But this cuts against some other moral intuitions, like "people shouldn't be made worse off as a means to an end" (e.g. I don't think we should have wars as a means to inspire poets). And presumably the people in the christian homeschools are worse off.
Maybe the compromise is something like:
I'm hopeful the details can be fleshed out in late crunch-time.
My best bet for what we should do with the North Sentinelese -- and with everyone post-singularity -- is that we uplift them if we think they would "ideally" want that. And "ideally" is in scare quotes because no one knows what that means.
it seems like the main reason people got less doomy was seeing that other people were working hard on the problem [...]
This would be v surprising to me!
It seems like, to the extend that we're less doomy about survival/flourishing, this isn't bc we've seen a surprising amount of effort, and think effort is v correlated with success. It's more like: our observations increase our confidence that the problem was easy all along, or that we have been living in a 'lucky' world all along.
I might ask you about this when I see you next -- I didn't attend the workshop so maybe I'm just wrong here.
I disagree that the primary application of safety research is improving refusal calibration. This take seems outdated by ~12 months.
I think labs are incentivised to share safety research even when they don't share capability research. This is follows a simple microeconomic model, but I wouldn't be surprised if the prediction was completely wrong.
Asymmetry between capability and safety:
What this predicts:
The result is an industry that is more coordinated than you might expect (on safety sharing) but less safe than it should be (on safety investment).
Can we define Embedded Agent like we define AIXI?
An embedded agent should be able to reason accurately about its own origins. But AIXI-style definitions via argmax create agents that, if they reason correctly about selection processes, should conclude they're vanishingly unlikely to exist.
Consider an agent reasoning: "What kind of process could have produced me?" If the agent is literally the argmax of some simple scoring function, then the selection process must have enumerated all possible agents, evaluated f on each, and picked the maximum. This is physically unrealizable: it requires resources exceeding what's available in the environment. So the agent concludes that it wasn't generated by the argmax.
The following seems like sound reasoning for an embedded agent: "I am a messy physical system in a messy universe, generated by a messy process. It is unlikely that my behavior is a clean mathematical function generated by argmaxing another clean mathematical function."
Yet for Embedded-AIXI defined via argmax, this reasoning is fallacious. This is a very handwavy obstacle for expecting an AIXI-style definition of embedded agency.
Another gloss: we can't define what it means for an embedded agent to be "ideal" because embedded agents are messy physical systems, and messy physical systems are never ideal. At most they're "good enough". So we should only hope to define when an embedded agent is good enough. Moreover, such agents must be generated by a physically realistic selection process.
This motivates Wentworth's (mostly abandoned) project of Selection Theorems, i.e. studying physically realistic generators of good enough embedded agents.
By AIXI-style I mean: we have some space of agents X, a real-valued scoring function f on X, and define the ideal agent as the argmax of f.
Some thoughts on public outreach and "Were they early because they were good or lucky?"
A bit anecdotal but: there are ~ a dozen people who went to our college in 2017-2020 now working full-time in AI safety, which is much higher than other colleges at the same university. I'm not saying any of us are particularly "great" -- but this suggests social contagion / information cascade, rather than "we figured this stuff out from the empty string". Maybe if you go back further (e.g. 2012-2016) there was less social contagion, and that cohort is better?
what’s the principle here? if an agent would have the same observations in world W and W’ then their preferences must be indifferent between W and W’ ? this seems clearly false.