Don't Influence the Influencers!
This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program. TL;DR: AGI is likely to turn out unsafe. One likely way that can happen is that it fools us into thinking it is safe. If we can make sure to look for models that are ineffective at "bad" things (so it can't deceive us) and effective at "good" things (so it is useful), and importantly, do that prior to the models reaching a point-of-no-return of capability, we can avert catastrophe. Which spaces of algorithms do we look in? What do they look like? Can we characterize them? We don't know yet. But we have a very concrete point in such a space: an "LCDT agent". Its details are simple and we'll look at it. Format note: The original post is already pretty well-written and I urge you to check it out. In trying to summarize an already well summarized post (to any alignment researcher anyway), I've aimed lower: catering to a dense set of possible attention-investments. This (ie. the linked dynalist in the first section) is an experiment, and hopefully more fun and clarifying than annoying, but I haven't had the time to incorporate much feedback to guarantee this. I hope you'll enjoy it anyway. Epistemic status: I'd say this post suffers from: deadline rushedness, low feedback, some abstract speculation, and of course, trying to reason about things that don't exist yet using frameworks that I barely trust. It benefits from: trying really hard to not steamroll over concerns, being honest about flailing, being prudent with rigor, a few discussions with the authors of the original post, and its main intent being clarification of what someone said rather than making claims of its own. Summary Here's the link to a dynalist page. Click on a bullet to expand or collapse it. Try a more BFS-ish exploration than a DFS one. If you've used something like Roam, it's similar. The rest of the post assumes you
I really worry about this and it has become quite a block. I want to support fragile baby ontologies emerging in me amidst a cacophony of "objective"/"reward"/etc. taken for granted.
Unfortunately, going off and trying to deconfuse the concepts on my own is slow and feedback-impoverished and makes it harder to keep up with current developments.
I think repurposing "roleplay" could work somewhat, with clearly marked entry and exit into a framing. But ontological assumptions absorb so illegibly that deliberate unseeing is extremely hard, at least without being constantly on guard.
Are there other ways that you recommend (from Framestorming or otherwise?)