Agent-foundations researcher. Working on Synthesizing Standalone World-Models, aiming at a technical solution to the AGI risk fit for worlds where alignment is punishingly hard and we only get one try.
Currently looking for additional funders ($1k+, details). Consider reaching out if you're interested, or donating directly.
Or get me to pay you money ($5-$100) by spotting holes in my agenda or providing other useful information.
I can think of plenty of reasons, of varying levels of sensibility.
Arguments
tl;dr:
E. g., Ryan Greenblatt thinks that spending just 5% more resources than is myopically commercially expedient would drive the risk down to 50%. AI 2027 also assumes something like this.
E. g., I think this is the position of Leopold Aschenbrenner.
Explanation
(The post describes a fallacy where you rule out a few specific members of a set using properties specific to those members, and proceed to conclude that you've ruled out that entire set, having failed to consider that it may have other members which don't share those properties. My comment takes specific examples of people falling into this fallacy that happened to be mentioned in the post, rules out that those specific examples apply to me, and proceeds to conclude that I'm invulnerable to this whole fallacy, thus committing this fallacy.
(Unless your comment was intended to communicate "I think your joke sucks", which, valid.))
Exhaustive Free Association is a step in a chain of reasoning where the logic goes "It's not A, it's not B, it's not C, it's not D, and I can't think of any more things it could be!"
Oh no, I wonder if I ever made that mistake.
Security Mindset
Hmm, no, I think I understand that point pretty well...
They listed out the main ways in which an AI could kill everyone (pandemic, nuclear war, chemical weapons) and decided none of those would be particularly likely to work
Definitely not it, I have a whole rant about it. (Come to think of it, that rant also covers the security-mindset thing.)
They perform an EFA to decide which traits to look for, and then they perform an EFA over different "theories of consciousness" in order to try and calculate the relative welfare ranges of different animals.
I don't think I ever published any EFAs, so I should be in the clear here.
The Fatima Sun Miracle
Oh, I'm not even religious.
Phew! I was pretty worried there for a moment, but no, looks like I know to avoid that fallacy.
Oh, thanks!
Another minor QoL improvement is the right-click behavior in the editor
That's really useful, thanks!
Any chance you can make a way to easily insert a horizontal-line element into comments? Perhaps add the button to the "show more items" submenu? People (me) sometimes write long-form comments with several different sections, and as-is, you have to go into your post-drafts folder and copy-paste it from there.
Some new data on that point:
Maybe if lots of noise is constantly being injected into the universe, this would change things. Because then the noise counts as part of the initial conditions. So the K-complexity of the universe-history is large, but high-level structure is common anyway because it's more robust to that noise?
To summarize what the paper argues (from my post in that thread):
- Suppose the microstate of a system is defined by a (set of) infinite-precision real numbers, corresponding to e. g. its coordinates in phase space.
- We define the coarse-graining as a truncation of those real numbers: i. e., we fix some degree of precision.
- That degree of precision could be, for example, the Planck length.
- At the microstate level, the laws of physics may be deterministic and reversible.
- At the macrostate level, the laws of physics are stochastic and irreversible. We define them as a Markov process, with transition probabilities defined as "the fraction of the microstates in the macrostate that map to the macrostate in the next moment".
- Over time, our ability to predict what state the system is in from our knowledge of its initial coarse-grained state + the laws of physics degrades.
- Macroscopically, it's because of the properties of the specific stochastic dynamic we have to use (this is what most of the paper is proving, I think).
- Microscopically, it's because ever-more-distant decimal digits in the definition of the initial state start influencing dynamics ever stronger. (See the multibaker map in Appendix A, the idea of "microscopic mixing" in a footnote, and also apparently Kolmogorov-Sinai entropy.)
- That is: in order to better pinpoint farther-in-time states, we would have to spend more bits (either by defining more fine-grained macrostates, or maybe by locating them in the execution trace).
- Thus: stochasticity, and the second law, are downstream of the fact that we cannot define the initial state with infinite precision.
I. e., it is effectively the case that there's (pseudo)randomness injected into the state-transition process.
And if a given state has some other regularities by which it could be compactly defined, aside from defining it through the initial conditions, that would indeed decrease its description length/algorithmic entropy. So we again recover the "trajectories that abstract well throughout their entire history are simpler" claim.
I've been thinking about it in terms of "but which language are we using to compute the complexity of our universe/laws of physics?". Usually I likewise just go "only matters up to an additive constant, just assume we're not using a Turing tarpit and we're probably good". If we do dig into it, though, what can we conclude?
Some thoughts:
What is the "objectively correct" reference language?
We should, of course, assume that the algorithm computing our universe is simple to describe in terms of the "natural" reference language, due to the simplicity prior. I. e., it should have support for the basic functions our universe's physics computes. I think that's already equivalent to "the machine can run our physics without insane implementation size".
On the flip side, it's allowed to lack support for functions our universe can't cheaply compute. For example, it may not have primitive functions for solving NP-complete problems. (In theory, I think there was nothing stopping physics from having fundamental particles that absorb Traveling Salesman problems and near-instantly emit their solutions.)
Now suppose we also assume that our observations are sampled from the distribution over all observers in Tegmark 4. This means that when we're talking about the language/TM underlying it, we're talking about some "natural", "objective" reference language.
What can we infer about it?
First, as mentioned, we should assume the reference language is not a Turing tarpit. After all, if we allowed reality to "think" in terms of some arbitrarily convoluted Turing-tarpit language, we could arbitrarily skew the simplicity prior.
But what is a "Turing tarpit" in that "global"/"objective" sense, not defined relative to some applications/programs? Intuitively, it feels like "one of the normal, sane languages that could easily implement all the other sane languages" should be possible to somehow formalize...
Which is to say: when we're talking about the Kolmogorov complexity of some algorithm, in what language are we measuring it? Intuitively, we want to, in turn, pick one of the "simplest" languages to define.[1] But what language do we pick for measuring this language's complexity? An infinite recursion follows.
Intuitively, there's perhaps some way to short-circuit that recursion. (Perhaps by somehow defining the complexity of a language by weighing its complexity across "all" languages while prioritizing the opinions of those languages which are themselves simple in terms of whatever complexity measure this expression defines? Or something along those lines, circular definitions not always a problem. (Though see an essay Tsvi linked to which breaks down why many of those definitions don't work.))
Regardless, if something like this is successful, we'll get a "global" definition of what counts as a simple/natural language. This would, in turn, allow us to estimate the "objective" complexity of various problems, by measuring the length of their solutions in terms of that natural language (i. e., the length of the execution trace of a computation solving the problem). This would perhaps show that some problems are "objectively" hard, such as some theoretical/philosophical problems or the NP-complete problems.
The speed prior
What if we try to compute the complexity not of the laws of physics, but of a given observer-moment/universe-state, and penalize the higher-complexity ones?
In chaotic systems, this actually works out to the speed prior: i. e., to assuming that the later steps of a program have less realityfluid than the early ones. Two lines of reasoning:
Anthropically, this means that the computations implementing us are (relatively) stable, and produce "interesting" states (relatively) quickly/in few steps.
Anyway, digging into the paper now...
1. Oh, I see it's likewise concerned with the description length of states:
Gács [23] defines the coarse-grained algorithmic entropy of any individual state: roughly speaking, it is the number of bits of information that a fixed computer needs in order to identify the state’s coarse-grained cell. For example, a state in which all particles are concentrated in one location would have low entropy, because the repeated coordinates can be printed by a short program. If the coarse graining in question is Markovian, then Levin’s [24] law of randomness conservation says that the algorithmic entropy seldom decreases. In physical terms, we will come to see this as a vast generalization of the second law of thermodynamics
2. The way the paper justifies the second law of thermodynamics is neat.
My understanding of that
3. The part about incomputability being necessary is also interesting, metaphysically.
Why must it be impossible to prove lower bounds on Kolmogorov complexity?
So, Kolmogorov complexity is upper-semicomputable. This means that, for some :
Imagine if it were otherwise, if some much smaller than could prove a lower bound on . Then you could use that to cheaply pinpoint : by setting up a program that goes through programs in order, uses to estimate the lower bound on their , then outputs the first program whose complexity is above a threshold. Which would simultaneously functions as an upper bound on : since our small program was able to compute it, can't be higher than .
Thus, in order for arbitrarily complex states/programs to exist, it must be impossible to prove that they are complex.
Why? Why does that have to be the case?
Intuitively, it's because "proving" complexity requires pointing at specific features of the state and explaining why exactly they are complex. That is, your formal language must be expressive enough to precisely talk about those features, in their full detail. If, however, you can get away with using some abstractions/generalizations to prove 's complexity, that by definition decreases 's complexity.
Impromptu poll: is structuring long-form comments this way, with collapsibles for topics, convenient, or should I have just used titles? Please react with thumbs up/down to the following statement: "collapsibles good".
All that said,
But this smells really promising to me [...] as a more principled way to tackle bounded rationality of embedded systems.
I'm curious what you have in mind here. I've kind of been treating my thinking on those topics as basically recreational/a guilty pleasure. The possibility that there's something actually useful here interests me.
It does so happen the answer is "basically yes" (me and a friend)
I'm now unsurprised.
I was surprised at the idea they were willing to change the conditions for an entire theater at the request of one person. Seems like if the ability to do that were known, it'd create obvious issues, with people with different preferences constantly walking up and asking to revert each other's requests. I assumed there was no rule against it only because it didn't actually occur to anyone to do that, and therefore it wasn't commonly exploited.
If the theater wasn't actually full of people with potentially different preferences (and that was known to the worker?), then that's not surprising.
...and actually, I'm not even really sure it's best to think of "shards" as having goals, either long-term or short-term
Agreed; I was speaking loosely. (One line of reasoning there goes: shards are contextually activated heuristics; heuristics can be viewed as having been optimized for achieving some goal; inspecting shards (via e. g. self-reflection) can lead to your "reverse-engineering" those implicitly encoded goals; therefore, shards can be considered "proto-goals/values" of a sort, and complex patterns of shard activations can draw the rough shape of goal-pursuit.)
Edited for clarity.
I'm curious, what's your estimate for how much resources it'd take to drive the risk down to 25%, 10%, 1%?