Agent-foundations researcher. Working on Synthesizing Standalone World-Models, aiming at a technical solution to the AGI risk fit for worlds where alignment is punishingly hard and we only get one try.
Currently looking for additional funders ($1k+, details). Consider reaching out if you're interested, or donating directly.
Or get me to pay you money ($5-$100) by spotting holes in my agenda or providing other useful information.
I've been recently irritated at something that feels vaguely similar with regards to research loops. Walking through this example...
I note that most of those problems have little to do with the printing technology, and mostly about poor organization around that technology.
Friction sources
What common threads can we see?:
I see a few directions we can take it.
In what other situations does this show up? An obvious example are various bureaucratic procedures, which require some research to make sense of, and where a small typo/mistake in your documents might lead to having to schedule another appointment next hour/day/month. That is: a procedure you'd plausibly interact with only one/a few times in your life is managed by people who interact with it constantly, and the bits of friction you run into are not obviously catastrophic, so those people are not forced to make things easy for first/imperfect/inexperienced users.
Note that the full thing is not necessarily easy to fix, much like e. g. technical debt isn't[1]. The work to keep systems easy-to-interface-with is, as usual, a battle against entropy. Overall, I feel this problem is just another manifestation of that one.
Things could certainly be made much better, though, by both:
And indeed, technical debt is another instance, where, for programmers maintaining a codebase, it's no-one's first priority to make it easy to understand/modify/interface-with for a one-time user.
Potentially the "inconsistent printer behavior" is downstream of that, by the way. Perhaps there's no obvious correct/convergent way to interact with low-level printer software, so each app's designers had to take their own approach, so there are inconsistent behaviors across apps.
My point is, that if you are a nascent super-villain, don't worry about telling people about your master plan. Almost no one will believe you. So the costs are basically non-existent.
Good try, but you won't catch me that easily!
Humans can't remember that many good passwords
I'd counter that you can in fact memorize ~arbitrarily many good passwords if you:
I'm not actually doing that for all my passwords, only for a dozen to the more important stuff.[2] But I expect this scales pretty well.
Well, technically, this is only approximately injective: the worst case is that adding words to two random sequences would map them to the same phrase. E. g.:
But I posit that this would never happen in practice.
I also use KeePass for the rest, sync'd through ProtonDrive. The fewer cloud services you have to trust, the better.
My response here would be similar to this one. I think there's a kind of "bitter lesson" here: for particularly complex fields, it's often easier to solve the general problem of which that field is an instance of, rather than attempting to solve the field directly. For example:
Like, yeah, after you've sketched out your general-purpose method and you're looking for where to apply it, you'd need to study the specific details of the application domain and tinker with your method's implementation. But the load-bearing, difficult step is deriving the general-purpose method itself; the last-step fine-tuning is comparatively easy.
In addition, I'm not optimistic about solving e. g. interpretability directly, simply because there's already a whole field of people trying to do that, to fairly leisurely progress. On intelligence-enhancement front, there would be mountains of regulatory red tape to go through, and the experimental loops would be rate-limited by the slow human biology. Etc., etc.
Right, I probably should've expected this objection and pre-addressed it more thoroughly.
I think this is a bit of "missing the forest for trees". In my view, every single human concept and every single human train of thoughts is an example of human world-models' autosymbolicity. What are "cats", "trees", and "grammar", if not learned variables from our world-model that we could retrieve, understand the semantics of, and flexibly use for the purposes of reasoning/problem-solving?
We don't have full self-interpretability by default, yes. We have to reverse-engineer our intuitions and instincts (e. g., grammar from your example), and for most concepts, we can't break their definitions down into basic mathematical operations. But in modern adult humans, there is a vast interpreted structure that contains an enormous amount of knowledge about the world, corresponding to, well, literally everything a human consciously knows. Which, importantly, includes every fruit of our science and technology.
If we understood an external superhuman world-model as well as a human understands their own world-model, I think that'd obviously get us access to tons of novel knowledge.
Edited for clarity.
I'm curious, what's your estimate for how much resources it'd take to drive the risk down to 25%, 10%, 1%?
I can think of plenty of reasons, of varying levels of sensibility.
Arguments
tl;dr:
E. g., Ryan Greenblatt thinks that spending just 5% more resources than is myopically commercially expedient would drive the risk down to 50%. AI 2027 also assumes something like this.
E. g., I think this is the position of Leopold Aschenbrenner.
Explanation
(The post describes a fallacy where you rule out a few specific members of a set using properties specific to those members, and proceed to conclude that you've ruled out that entire set, having failed to consider that it may have other members which don't share those properties. My comment takes specific examples of people falling into this fallacy that happened to be mentioned in the post, rules out that those specific examples apply to me, and proceeds to conclude that I'm invulnerable to this whole fallacy, thus committing this fallacy.
(Unless your comment was intended to communicate "I think your joke sucks", which, valid.))
Exhaustive Free Association is a step in a chain of reasoning where the logic goes "It's not A, it's not B, it's not C, it's not D, and I can't think of any more things it could be!"
Oh no, I wonder if I ever made that mistake.
Security Mindset
Hmm, no, I think I understand that point pretty well...
They listed out the main ways in which an AI could kill everyone (pandemic, nuclear war, chemical weapons) and decided none of those would be particularly likely to work
Definitely not it, I have a whole rant about it. (Come to think of it, that rant also covers the security-mindset thing.)
They perform an EFA to decide which traits to look for, and then they perform an EFA over different "theories of consciousness" in order to try and calculate the relative welfare ranges of different animals.
I don't think I ever published any EFAs, so I should be in the clear here.
The Fatima Sun Miracle
Oh, I'm not even religious.
Phew! I was pretty worried there for a moment, but no, looks like I know to avoid that fallacy.
I was hoping someone would go ahead and try this. Great work, love it.
Hm, I think that specific argument falls through in that case. Suppose humans indeed like BFWHASF more than ice cream, but mostly eat the latter due to practical constraints. That means that, once we become more powerful and those constraints fall away, we would switch over to BFWHASF. But that's actually the regime in which we're not supposed to be able to predict what an agent would do!
As the argument goes, in a constrained environment with limited options (such as when a relatively stupid AGI is still trapped in our civilization's fabric), an agent might appear to pursue values it was naively shaped to have. But once it grows more powerful, and gets the ability to take what it really wants, it would go for some weird unpredictable edge-instantiation thing. The suggested "BFWHASF vs. ice cream" case would actually be the inverse of that.
The core argument should still hold:
But I do think BFWHASF doesn't end up as a very good illustration of this point, if humans indeed like it a lot.
Just tune the implementation details of all of this such that the pipeline still meets those people's aesthetic preferences for "natural-ness". E. g., note that people who want "real meat" today are perfectly fine with eating the meat of selectively bred beasts.