Thane Ruthenis — LessWrong

Agent-foundations researcher. Working on Synthesizing Standalone World-Models, aiming at a technical solution to the AGI risk fit for worlds where alignment is punishingly hard and we only get one try.

Currently looking for additional funders ($1k+, details). Consider reaching out if you're interested, or donating directly.

Or get me to pay you money ($5-$100) by spotting holes in my agenda or providing other useful information.

Nitpick:

If you can't predict what you are going to do tomorrow, your opponents can't either.

Not necessarily. If you're using a shoddy randomization procedure, a smarter opponent can identify flaws in it, and then they'd be better able to predict you than you can predict yourself. E. g., "vibes-based" randomization, where you just do whatever feels random to you at the moment, is potentially vulnerable in this way, due to your (deterministic) ideas regarding what's appealing or what's random.

Can you make sense of this?

Here's a crack at it:

The space of possible inferential steps is very high-dimensional, most steps are difficult, and there's no known way to strongly bias your policy towards making simple-but-useful steps. Human specialists, therefore, could at best pick a rough direction that leads to accomplishing some goal they have, and then attempt random steps roughly pointed in that direction. Most of those random steps are difficult. A human succeeds if the step's difficulty is below some threshold, and fails and goes back to square one otherwise. Over time, this results in a biased-random-walk process that stumbles upon a useful application once in a while. If one then looks back, one often sees a sequence of very difficult steps that led to this application (with a bias towards steps at the very upper end of what humans can tackle).

In other words: The space of steps is more high-dimensional than human specialists are numerous, and our motion through it is fairly random. If one picks some state of human knowledge, and considers all directions in which anyone has ever attempted to move from that state, that wouldn't produce a comprehensive map of that state's neighbourhood. There's therefore no reason to expect that all "low-hanging fruits" have been picked, because locating those low-hanging fruits is often harder than picking some high-hanging one.

I was hoping someone would go ahead and try this. Great work, love it.

I think Eliezer Yudkowsky’s argument still has some merit, even if some people actually enjoy bear fat with honey and salt flakes more than ice cream

Hm, I think that specific argument falls through in that case. Suppose humans indeed like BFWHASF more than ice cream, but mostly eat the latter due to practical constraints. That means that, once we become more powerful and those constraints fall away, we would switch over to BFWHASF. But that's actually the regime in which we're not supposed to be able to predict what an agent would do!

As the argument goes, in a constrained environment with limited options (such as when a relatively stupid AGI is still trapped in our civilization's fabric), an agent might appear to pursue values it was naively shaped to have. But once it grows more powerful, and gets the ability to take what it really wants, it would go for some weird unpredictable edge-instantiation thing. The suggested "BFWHASF vs. ice cream" case would actually be the inverse of that.

The core argument should still hold:

For one, BFWHASF is surely not the optimal food. Once we have the ability to engineer arbitrary food items, and also modify our taste buds how we see fit, we would likely go for something much more weird.
The quintessential example would of course be us getting rid of the physical implementation of food altogether, and instead focusing on optimizing substrate-independent (e. g., simulated) food-eating experiences (ones not involving even simulated biology). That maps to "humans are killed and replaced with chatbots (not even uploaded humans) babbling nice things" or whatever.
Even people who prefer "natural" food would likely go for e. g. the meat of beasts from carefully designed environments with very fast evolutionary loops set up to make their meat structured for maximum tastiness.^[1] Translated to ASI and humans, this suggests an All Tomorrows kind of future (except without Qu going away).

But I do think BFWHASF doesn't end up as a very good illustration of this point, if humans indeed like it a lot.

^{^}
Just tune the implementation details of all of this such that the pipeline still meets those people's aesthetic preferences for "natural-ness". E. g., note that people who want "real meat" today are perfectly fine with eating the meat of selectively bred beasts.

I've been recently irritated at something that feels vaguely similar with regards to research loops. Walking through this example...

I note that most of those problems have little to do with the printing technology, and mostly about poor organization around that technology.

Friction sources

Choosing which printer to use isn't trivial for someone who doesn't already know it.
Counter-intuitive print dialogue which doesn't make it clear how to locate a specific printer/add a new one to someone who doesn't already know.
The printer's physical location isn't clear from where you control it.
The printer is not in the same room as where you control it. This means that if you have to iterate on printing due to (potentially minor) errors, each iteration is unnecessarily extended by having to walk back and forth.
No fine-grained feedback on what the printer does from where you control it (that it only printed two of your pages and then ran out, that page 1 printed blank).
Paper is not in the same room as the printer.
Locations of things aren't trivially clear to someone who doesn't happen to know them.
- The printer's initial location.
- The printer's new location.
- Room 1A1's location.
- The location of the paper for the printer.
- The new location of the paper.
Poor organization of printer output (that it just dumps everyone's print-outs in the same stack; not sure if that was a problem, but sorting that out can add friction too).
Different printer behaviors depending on the software you print from, and what page you print, even though all entity types (documents, pages, function mapping from document-pages to physical pages) should be the same.

What common threads can we see?:

Systems are not optimized to make it easy to interface with by someone who is using them the first time; someone with zero local tacit knowledge.
- Printer/room/paper locations, printer choice, print dialogue.
- Arguably (9) fits too: probably the different behaviors were due to some peculiarity of the printing algorithms in different software that someone very familiar with that software would know about.
No affordances for easily recovering from common failure modes/errors/hiccups, or even for quickly learning about them.
- Paper in a different room from the printer, printer in a different room from where you control it, no feedback on printer behavior from where you control it.

I see a few directions we can take it.

Direction #1: No affordances for one-time users.
- People who set up/modify systems are usually people who work with those systems a lot. They have tons of tacit knowledge about where things are, how to use them, how to avoid and recover from failures, etc. Modeling the mental state of someone who doesn't have your tacit knowledge is notoriously hard, and even when it isn't ("a random person wouldn't know where I moved the paper"), there are no established societal norms/organizational standards regarding making things frictionlessly easy for first-time users.
- Even when there are manuals/instructions/etc., they usually assume that you're going to be thoroughly figuring out how to work with that system, as if you aimed to become a frequent user, rather than doing a specific thing once and never interacting with it again.
- The problem is that "a person interacts with a system once and then never again" is a very common occurrence. For some systems (e. g., specific bureaucratic procedures), this is potentially the bulk of user experience; and on the flipside, "interacting with a system for the first time and never again" is a significant fraction of any given person's life experience.
Direction #2: Systemic accumulation of small friction sources.
- All of those individual sources of friction were annoying, but ultimately not that time-consuming (figuring out things' locations, how to install/locate a printer, replacing paper, re-printing things).
- What made the whole experience miserable is that there were tons of those sources of minor friction which came into effect simultaneously.
- Thus, it's easy to see why the person responsible for creating each bit of friction didn't think of it as a big deal, or why we don't have societal/organizational standards regarding not creating such minor friction. (Consider that making a big deal of complaining about any given bit of friction in isolation would make you look like a loon/a "Karen".)
- But the problem is that tons of systems you interact with end up with these small bits of friction cropping up, so your ability to move forward is constantly mired down by that stuff.
Direction #3: No affordances for quickly recovering from small failures.
- Some sort of default assumptions that things will go as intended, with no minor errors, hiccups, need to fix problems/iterate, etc.
- And as long as things-going-wrong doesn't create a big incident (e. g., someone dying with that as the clear leading cause), nobody is particularly incentivized to think about that and prevent that.
- So there are no affordances for easily recovering from them.
- But this pattern reoccurs everywhere, so a large fraction of the human experience involves dealing with them.

In what other situations does this show up? An obvious example are various bureaucratic procedures, which require some research to make sense of, and where a small typo/mistake in your documents might lead to having to schedule another appointment next hour/day/month. That is: a procedure you'd plausibly interact with only one/a few times in your life is managed by people who interact with it constantly, and the bits of friction you run into are not obviously catastrophic, so those people are not forced to make things easy for first/imperfect/inexperienced users.

Note that the full thing is not necessarily easy to fix, much like e. g. technical debt isn't^[1]. The work to keep systems easy-to-interface-with is, as usual, a battle against entropy. Overall, I feel this problem is just another manifestation of that one.

Things could certainly be made much better, though, by both:

Establishing high-level organizational standards/policies (e. g., keeping printer paper in the printing room).
Creating low-level societal norms (if you moved paper somewhere, leave a post-it about that).

^{^}
And indeed, technical debt is another instance, where, for programmers maintaining a codebase, it's no-one's first priority to make it easy to understand/modify/interface-with for a one-time user.
Potentially the "inconsistent printer behavior" is downstream of that, by the way. Perhaps there's no obvious correct/convergent way to interact with low-level printer software, so each app's designers had to take their own approach, so there are inconsistent behaviors across apps.

My point is, that if you are a nascent super-villain, don't worry about telling people about your master plan. Almost no one will believe you. So the costs are basically non-existent.

Good try, but you won't catch me that easily!

Humans can't remember that many good passwords

I'd counter that you can in fact memorize ~arbitrarily many good passwords if you:

Use a procedure that generates passwords that are high-entropy but memorable.
- A memorability-enhancing improvement on the common "generate N random words" is "generate N random words, then add more words until the passphrase is grammatically correct". (Important: don't rearrange or change the initial random words in any way!)
  - E. g., "lamp naval sunset TV" → "the lamp allows the naval sunset to watch TV".
- This doesn't reduce entropy, since this is at worst an injective approximately deterministic function of the initial string.^[1] But, for me at least, it makes it significantly easier to recall the password (since that makes it "roll off the tongue").
Use spaced repetition on them. Like, literally put "recall the password for %website%" in an Anki deck. (Obviously don't put the actual password there! Just create a prompt to periodically remind yourself of it.)

I'm not actually doing that for all my passwords, only for a dozen to the more important stuff.^[2] But I expect this scales pretty well.

^{^}
Well, technically, this is only approximately injective: the worst case is that adding words to two random sequences would map them to the same phrase. E. g.:
- "lamp naval sunset TV" → "the lamp allows the naval sunset to watch TV"
- "lamp naval watch TV" → "the lamp allows the naval sunset to watch TV"
But I posit that this would never happen in practice.
^{^}
I also use KeePass for the rest, sync'd through ProtonDrive. The fewer cloud services you have to trust, the better.

My response here would be similar to this one. I think there's a kind of "bitter lesson" here: for particularly complex fields, it's often easier to solve the general problem of which that field is an instance of, rather than attempting to solve the field directly. For example:

If you're trying to solve mechanistic interpretability, studying a specific LLM in detail isn't the way; you'd be better off trying to find methods that generalize across many LLMs.
If you're trying to solve natural-language processing, turns out tailor-made methods are dramatically out-performed by general-purpose generative models (LLMs) trained by a general-purpose search method (SGD).
If you're trying to advance bioscience, you can try building models of biology directly, or you can take the aforementioned off-the-shelf general-purpose generative model, dump biology data into it, and get a tool significantly ahead of your manual efforts.
Broadly, LLMs/DL have "solved" or outperformed a whole bunch of fields at once, without even deliberately trying, simply as the result of looking for something general and scalable.

Like, yeah, after you've sketched out your general-purpose method and you're looking for where to apply it, you'd need to study the specific details of the application domain and tinker with your method's implementation. But the load-bearing, difficult step is deriving the general-purpose method itself; the last-step fine-tuning is comparatively easy.

In addition, I'm not optimistic about solving e. g. interpretability directly, simply because there's already a whole field of people trying to do that, to fairly leisurely progress. On intelligence-enhancement front, there would be mountains of regulatory red tape to go through, and the experimental loops would be rate-limited by the slow human biology. Etc., etc.

Right, I probably should've expected this objection and pre-addressed it more thoroughly.

I think this is a bit of "missing the forest for trees". In my view, every single human concept and every single human train of thoughts is an example of human world-models' autosymbolicity. What are "cats", "trees", and "grammar", if not learned variables from our world-model that we could retrieve, understand the semantics of, and flexibly use for the purposes of reasoning/problem-solving?

We don't have full self-interpretability by default, yes. We have to reverse-engineer our intuitions and instincts (e. g., grammar from your example), and for most concepts, we can't break their definitions down into basic mathematical operations. But in modern adult humans, there is a vast interpreted structure that contains an enormous amount of knowledge about the world, corresponding to, well, literally everything a human consciously knows. Which, importantly, includes every fruit of our science and technology.

If we understood an external superhuman world-model as well as a human understands their own world-model, I think that'd obviously get us access to tons of novel knowledge.

Edited for clarity.

I'm curious, what's your estimate for how much resources it'd take to drive the risk down to 25%, 10%, 1%?

I can think of plenty of reasons, of varying levels of sensibility.

Arguments

Some people believe that (a) controlled on-paradigm ASI is possible, but that (b) it would require spending some nontrivial amount of resources/time on alignment/control research^[1], and that (c) the US AGI labs are much more likely to do it than the Chinese ones. Therefore, the US winning is less likely to lead to omnicide.
- I think it's not unreasonable to believe (c), so if you believe (a) and (b), as many people do, the conclusion checks out. I assign low (but nonzero) probability to (a), though.
Even if the Chinese labs can keep ASI aligned/under control, some people are scared of being enslaved by the CCP, and think that the USG becoming god is going to be better for them.^[2] This probably includes people who profess to only care about the nobody-should-build-it thing: they un/semi-consciously track the S-risk possibility, and it's awful-feeling enough to affect their thinking even if they assign it low probability.
- I think that's a legitimate worry; S-risks are massively worse than X-risks. But I don't expect the USG's apotheosis to look pretty either, especially not under the current administration, and same for the apotheosis of most AGI labs, so the point is mostly moot.
- I guess Anthropic or maybe DeepMind could choose non-awful results? So sure, if the current paradigm can lead to controlled ASI, and the USG stays asleep, and Anthropic/DM are the favorites to win, "make China lose" has some sense.
Variant on the above scenarios, but which does involve an international pause, with some coordination to only develop ASI once it can be kept under control. This doesn't necessarily guarantee that the ASI, once developed, will be eudaimonic, so "who gets to ASI first/has more say on ASI" may matter; GOTO (2).
The AI-Risk advocates may feel that they have more influence on the leadership of the US labs. For US-based advocates, this is almost certainly correct. If that leadership can be convinced to pause, this buys us as much time as it'd take for the runners-up to catch up. Thus, the further behind China is, the more time we can buy in this hypothetical.
1. In addition, if China is way behind, it's more likely that the US AGI labs would agree to stop, since more time to work would increase the chances of success of [whatever we want the pause for, e. g. doing alignment research or trying to cause an international ban].
Same as (4), but for governments. Perhaps the USG is easier to influence into arguing for an international pause. If so, both (a) the USG is more likely to do this if it feels that it's comfortably ahead of China rather than nose-to-nose, (b) China is more likely to agree to an international ban if the USG is speaking from a position of power and is ahead on AI than if it's behind/nose-to-nose. (Both because the ban would be favorable to China geopolitically, and because the X-risk arguments would sound more convincing if they don't look like motivated reasoning/bullshit you're inventing to convince China to abandon a technology that gives it geopolitical lead on the US.)
Some less sensible/well-thought-out variants of the above, e. g.:
- Having the illusion of having more control over the US labs/government.
- Semi/un-consciously feeling that it'd be better if your nation ends the world than if the Chinese do it.
- Semi/un-consciously feeling that it'd be better if your nation is more powerful/ahead of a foreign one, independent of any X-risk considerations.
Suppose you think the current paradigm doesn't scale to ASI, or that we'll succeed in internationally banning ASI research. The amount of compute at a nation's disposal is likely to still be increasingly important in the coming future (just because it'd allow to better harness the existing AI technology, for military and economic ends). Thus, constricting China's access is likely to be better for the US as well.
- This has nothing to do with X-risks though, it's prosaic natsec stuff.

tl;dr:

If we get alignment by default, some US-based actors winning may be more likely to lead to a good future than the Chinese actors winning.
If on-paradigm ASI alignment is possible given some low-but-nontrivial resource expenditure, the US labs may be more likely to spend the resources on it than the Chinese ones.
US AI Safety advocates may have more control over the US AGI labs and/or the USG. The more powerful those are relative to the foreign AGI researchers, the more leverage that influence provides, including for slowing down/banning AGI research.
US AI Safety advocates may be at least partly motivated by dumb instincts for "my nation good, their nation bad", and therefore want the US to win even if it's winning a race-to-suicide.
Keeping a compute lead may be geopolitically important even in non-ASI worlds.

^{^}
E. g., Ryan Greenblatt thinks that spending just 5% more resources than is myopically commercially expedient would drive the risk down to 50%. AI 2027 also assumes something like this.
^{^}
E. g., I think this is the position of Leopold Aschenbrenner.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments