All of johnswentworth's Comments + Replies

Consider two claims:

  • Any system can be modeled as maximizing some utility function, therefore utility maximization is not a very useful model
  • Corrigibility is possible, but utility maximization is incompatible with corrigibility, therefore we need some non-utility-maximizer kind of agent to achieve corrigibility

These two claims should probably not both be true! If any system can be modeled as maximizing a utility function, and it is possible to build a corrigible system, then naively the corrigible system can be modeled as maximizing a utility function.

I exp... (read more)

2Steven Byrnes3h
FWIW I endorse the second claim when the utility function depends exclusively on the state of the world in the distant future, whereas I endorse the first claim when the utility function can depend on anything whatsoever (e.g. what actions I’m taking right this second). (details [https://www.lesswrong.com/posts/KDMLJEXTWtkZWheXt/consequentialism-and-corrigibility]) I wish we had different terms for those two things. That might help with any alleged yay/boo reasoning. (When Eliezer talks about utility functions, he seems to assume that it depends exclusively on the state of the world in the distant future.)
2Vladimir_Nesov6h
A utility function represents preference elicited in a large collection of situations, each a separate choice between events that happens with incomplete information, as an event is not a particular point. This preference needs to be consistent across different situations to be representable by expected utility of a single utility function. Once formulated, a utility function can be applied to a single choice/situation, such as a choice of a policy. But a system that only ever makes a single choice is not a natural fit for expected utility frame, and that's the kind of system that usually appears in "any system can be modeled as maximizing some utility function". So it's not enough to maximize something once, or in a narrow collection of situations, the situations the system is hypothetically exposed to need to be about as diverse as choices between any pair of events, with some of the events very large, corresponding to unreasonably incomplete information, all drawn across the same probability space. One place this mismatch of frames happens is with updateless decision theory. An updateless decision is a choice of a single policy, once and for all, so there is no reason for it to be guided by expected utility [https://www.lesswrong.com/posts/XYDsYSbBjqgPAgcoQ/why-the-focus-on-expected-utility-maximisers?commentId=a5tn6B8iKdta6zGFu], even though it could be. The utility function for the updateless choice of policy would then need to be obtained elsewhere, in a setting that has all these situations with separate (rather than all enacting a single policy) and mutually coherent choices under uncertainty. But once an updateless policy is settled (by a policy-level decision), actions implied by it (rather than action-level decisions in expected utility frame) no longer need to be coherent. Not being coherent, they are not representable by an action-level utility function. So by embracing updatelessness, we lose the setting that would elicit utility if the actions were
1JNS9h
Completely off the cuff take: I don't think claim 1 is wrong, but it does clash with claim 2. That means any system that has to be corrigible cannot be a system that maximizes a simple utility function (1 dimension), or put another way "whatever utility function is maximizes must be along multiple dimensions". Which seems to be pretty much what humans do, we have really complex utility functions, and everything seems to be ever changing and we have some control over it ourselves (and sometimes that goes wrong and people end up maxing out a singular dimension at the cost of everything else). Note to self: Think more about this and if possible write up something more coherent and explanatory.

The Wrights invented the airplane using an empirical, trial-and-error approach. They had to learn from experience. They couldn’t have solved the control problem without actually building and testing a plane. There was no theory sufficient to guide them, and what theory did exist was often wrong. (In fact, the Wrights had to throw out the published tables of aerodynamic data, and make their own measurements, for which they designed and built their own wind tunnel.)

This part in particular is where I think there's a whole bunch of useful lessons for alignment... (read more)

Major problem with that particular name: in philosophy, "intention" means something completely different from the standard use. From SEP:

In philosophy, intentionality is the power of minds and mental states to be about, to represent, or to stand for, things, properties and states of affairs. To say of an individual’s mental states that they have intentionality is to say that they are mental representations or that they have contents.

So e.g. Dennett's "intentional stance" does not mean what you probably thought it did, if you've heard of it! (I personally learned of this just recently, thankyou Steve Peterson.)

4Caspar Oesterheld20h
Do philosophers commonly use the word "intention" to refer to mental states that have intentionality, though? For example, from the SEP article on intentionality [https://plato.stanford.edu/entries/intentionality/]: >intention and intending are specific states of mind that, unlike beliefs, judgments, hopes, desires or fears, play a distinctive role in the etiology of actions. By contrast, intentionality is a pervasive feature of many different mental states: beliefs, hopes, judgments, intentions, love and hatred all exhibit intentionality. (This is specifically where it talks about how intentionality and the colloquial meaning of intention must not be confused, though.) Ctrl+f-ing through the SEP article gives only one mention of "intention" that seems to refer to intentionality. ("The second horn of the same dilemma is to accept physicalism and renounce the 'baselessness' of the intentional idioms and the 'emptiness' of a science of intention.") The other few mentions of "intention" seem to talk about the colloquial meaning. The article seems to generally avoid the avoid "intention". Generally the article uses "intentional" and "intentionality". Incidentally, there's also an SEP article on "intention" [https://plato.stanford.edu/entries/intention/] that does seem to be about what one would think it to be about. (E.g., the first sentence of that article: "Philosophical perplexity about intention begins with its appearance in three guises: intention for the future, as I intend to complete this entry by the end of the month; the intention with which someone acts, as I am typing with the further intention of writing an introductory sentence; and intentional action, as in the fact that I am typing these words intentionally.") So as long as we don't call it "artificial intentionality research" we might avoid trouble with the philosophers after all. I suppose the word "intentional" becomes ambiguous, however. (It is used >100 times in both SEP articles.)
6chaosmage1d
I fail to see how that's a problem.

Y'know, I didn't realize until reading this that I hadn't seen a short post spelling it out before. The argument was just sort of assumed background in a lot of conversations. Good job noticing and spelling it out.

I believe the authors did a regression, so slightly fancier than that, but basically yes.

Scaling up the data wasn't algorithmic progress. Knowing that they needed to scale up the data was algorithmic progress.

6O O9d
It seems particularly trivial from an algorithmic aspect? You have the compute to try an idea so you try it. The key factor is still the compute. Unless you’re including the software engineering efforts required to get these methods to work at scale, but I doubt that?

That would, and in general restrictions aimed at increasing price/reducing supply could work, though that doesn't describe most GPU restriction proposals I've heard.

Note that this probably doesn't change the story much for GPU restrictions, though. For purposes of software improvements, one needs compute for lots of relatively small runs rather than one relatively big run, and lots of relatively small runs is exactly what GPU restrictions (as typically envisioned) would not block.

6GeneSmith10d
Couldn't GPU restrictions still make them more expensive? Like let's say tomorrow that we impose a tax on all new hardware that can be used to train neural networks such that any improvements in performance will be cancelled out by additional taxes. Wouldn't that also slow down or even stop the growth of smaller training runs?

I expect words are usually pointers to natural abstractions, so that part isn't the main issue - e.g. when we look at how natural language fails all the time in real-world coordination problems, the issue usually isn't that two people have different ideas of what "tree" means. (That kind of failure does sometimes happen, but it's unusual enough to be funny/notable.) The much more common failure mode is that a person is unable to clearly express what they want - e.g. a client failing to communicate what they want to a seller. That sort of thing is one reason why I'm highly uncertain about the extent to which human values (or other variations of "what humans want") are a natural abstraction.

So I saw the Taxonomy Of What Magic Is Doing In Fantasy Books  and Eliezer’s commentary on ASC's latest linkpost, and I have cached thoughts on the matter.

My cached thoughts start with a somewhat different question - not "what role does magic play in fantasy fiction?" (e.g. what fantasies does it fulfill), but rather... insofar as magic is a natural category, what does it denote? So I'm less interested in the relatively-expansive notion of "magic" sometimes seen in fiction (which includes e.g. alternate physics), and more interested in the pattern cal... (read more)

Yes, that's an accurate reframing.

There's an asymmetry between local differences from the true model which can't match the true distribution (typically too few edges) and differences which can (typically too many edges). The former get about O(n) bits against them per local difference from the true model, the latter about O(log(n)), as the number of data points n grows.

Conceptually, the story for the log(n) scaling is: with n data points, we can typically estimate each parameter to ~log(n) bit precision. So, an extra parameter costs ~log(n) bits.

Just that does usually work pretty well for (at least a rough estimate of) the undirected graph structure, but then you don't know the directions of any arrows.

2tailcalled1mo
I think this approach only gets the direction of the arrows from two structures, which I'll call colliders and instrumental variables (because that's what they are usually called). Colliders are the case of A -> B <- C, which in terms of correlations shows up as A and B being correlated, B and C being correlated, and A and C being independent. This is a distinct pattern of correlations from the A -> B -> C or A <- B -> C structures where all three could be correlated, so it is possible for this method to distinguish the structures (well, sometimes not, but that's tangential to my point [https://www.researchgate.net/publication/276286754_When_causation_does_not_imply_correlation_robust_violations_of_the_Faithfulness_axiom]). Instrumental variables are the case of A -> B -> C, where A -> B is known but the direction of B - C is unknown. In that case, the fact that C correlates with A suggests that B -> C rather than B <- C. I think the main advantage larger causal networks give you is that they give you more opportunities to apply these structures? But I see two issues with them. First, they don't see to work very well in nondeterministic cases. They both rely on the correlation between A and C, and they both need to distinguish whether that correlation is 0 or ab⋅bc (where ab and bc refer to the effects A - B and B - C) respectively. If the effects in your causal network are of order e, then you are basically trying to distinguish something of order 0 from something of order e2, which is likely going to be hard if e is small. (The smaller of a difference you are trying to detect, the more affected you are going to be by model misspecification, unobserved confounders, measurement error, etc..) This is not a problem in Zack's case because his effects are near-deterministic, but it would be a problem in other cases. (I in particularly have various social science applications in mind.) Secondly, Zack's example had an advantage that multiple root causes of wet sidewa

I've tried this before experimentally - i.e. code up a gaussian distribution with a graph structure, then check how well different graph structures compress the distribution. Modulo equivalent graph structures (e.g. A -> B -> C vs A <- B <- C vs A <- B -> C), the true structure is pretty consistently favored.

4tailcalled1mo
I don't think this is much better than just linking up variables to each other if they are strongly correlated (at least in ways not explained by existing links)?
4Adele Lopez1mo
Do you know exactly how strongly it favors the true (or equivalent) structure?

(Maybe the disparity scales up with non-tiny examples, though?)

Yup, that's right.

I'm not sure what motivation for worst-case reasoning you're thinking about here. Maybe just that there are many disjunctive ways things can go wrong other than bad capability evals and the AI will optimize against us?

This getting very meta, but I think my Real Answer is that there's an analogue of You Are Not Measuring What You Think You Are Measuring for plans. Like, the system just does not work any of the ways we're picturing it at all, so plans will just generally not at all do what we imagine they're going to do.

(Of course the plan could still in-pri... (read more)

This answer clears the bar for at least some prize money to be paid out, though the amount will depend on how far other answers go by the deadline.

One thing which would make it stronger would be to provide a human-interpretable function for each equivalence class (so Alice can achieve the channel capacity by choosing among those functions).

The suggestions for variants of the problem are good suggestions, and good solutions to those variants would probably also qualify for prize money.

Yes, there is a story for a canonical factorization of , it's just separate from the story in this post.

Sounds like we need to unpack what "viewing  as a latent which generates " is supposed to mean.

I start with a distribution . Let's say  is a bunch of rolls of a biased die, of unknown bias. But I don't know that's what  is; I just have the joint distribution of all these die-rolls. What I want to do is look at that distribution and somehow "recover" the underlying latent variable (bias of the die) and factorization, i.e. notice that I can write the distribution as , where  i... (read more)

4Rohin Shah1mo
Okay, I understand how that addresses my edit. I'm still not quite sure why the lightcone theorem is a "foundation" for natural abstraction (it looks to me like a nice concrete example on which you could apply techniques) but I think I should just wait for future posts, since I don't really have any concrete questions at the moment.

Phase transitions are definitely on the todo list of things to reinvent. Haven't thought about lattice waves or phonons; I generally haven't been assuming any symmetry (including time symmetry) in the Bayes net, which makes such concepts trickier to port over.

2shminux1mo
I guess even without symmetry if one assumes finite interaction time, and the nearest-neighbor-only interaction, an analog of the light cone emerges from these two assumptions. As in, Nth neighbor is unaffected until the time Nt where t is the characteristic interaction time. But I assume you are claiming something much less trivial than that.

 is conceptually just the whole bag of abstractions (at a certain scale), unfactored.

4Thane Ruthenis1mo
Sure, but isn't the goal of the whole agenda to show that Λ does have a certain correct factorization, i. e. that abstractions are convergent? I suppose it may be that any choice of low-level boundaries results in the same Λ, but the Λ itself has a canonical factorization, and going from Λ back to XT reveals the corresponding canonical factorization of XT? And then depending on how close the initial choice of boundaries was to the "correct" one, Λ is easier or harder to compute (or there's something else about the right choice that makes it nice to use).

If you have sets of variables that start with no mutual information (conditioning on ), and they are so far away that nothing other than  could have affected both of them (distance of at least ), then they continue to have no mutual information (independent).

Yup, that's basically it. And I agree that it's pretty obvious once you see it - the key is to notice that distance  implies that nothing other than  could have affected both of them. But man, when I didn't know that was what I should look for? Much les... (read more)

6Rohin Shah1mo
Okay, that mostly makes sense. I agree this is true, but why does the Lightcone theorem matter for it? It is also a theorem that a Gibbs resampler initialized at equilibrium will produce XT distributed according to X, and as you say it's clear that the resampler throws away a ton of information about X0 in computing it. Why not use that theorem as the basis for identifying the information to throw away? In other words, why not throw away information from X0 while maintaining  XT∼X ? EDIT: Actually, conditioned on X0, it is not the case that XT is distributed according to X. (Simple counterexample: Take a graphical model where node A can be 0 or 1 with equal probability, and A causes B through a chain of > 2T steps, such that we always have B = A for a true sample from X. In such a setting, for a true sample from X, B should be equally likely to be 0 or 1, but  BT∣X0=B0, i.e. it is deterministic.) Of course, this is a problem for both my proposal and for the Lightcone theorem -- in either case you can't view X0 as a latent that generates X (which seems to be the main motivation, though I'm still not quite sure why that's the motivation).

Yup, that's basically it. And I agree that it's pretty obvious once you see it - the key is to notice that distance  implies that nothing other than  could have affected both of them. But man, when I didn't know that was what I should look for? Much less obvious.

... I feel compelled to note that I'd pointed out a very similar thing a while ago.

Granted, that's not exactly the same formulation, and the devil's in the details.

Let  be the initial state of a Gibbs sampler on an undirected probabilistic graphical model, and  be the final state. Assume the sampler is initialized in equilibrium, so the distribution of both  and  is the distribution given by the graphical model.

Take any subsets  of , such that the variables in each subset are at least a distance  away from the variables in the other subsets (with distance given by shortest path length in the graph). Then  ... (read more)

Ah, no, I suppose that part is supposed to be handled by whatever approximation process we define for ? That is, the "correct" definition of the "most minimal approximate summary" would implicitly constrain the possible choices of boundaries for which  is equivalent to ?

Almost. The hope/expectation is that different choices yield approximately the same , though still probably modulo some conditions (like e.g. sufficiently large ).

What's the  here? Is it meant to be ?

System size, i.e. number of variab... (read more)

6Thane Ruthenis1mo
Can you elaborate on this expectation? Intuitively, Λ should consist of a number of higher-level variables as well, and each of them should correspond to a specific set of lower-level variables: abstractions and the elements they abstract over. So for a given Λ, there should be a specific "correct" way to draw the boundaries in the low-level system. But if ~any way of drawing the boundaries yields the same Λ, then what does this mean? Or perhaps the "boundaries" in the mesoscale-approximation approach represent something other than the factorization of X into individual abstractions?
7Thane Ruthenis1mo
By the way, do we need the proof of the theorem to be quite this involved? It seems we can just note that for for any two (sets of) variables X1, X2 separated by distance D, the earliest sampling-step at which their values can intermingle (= their lightcones intersect) is D/2 (since even in the "fastest" case, they can't do better than moving towards each other at 1 variable per 1 sampling-step).

First crucial point which this post is missing: the first (intuitively wrong) net reconstructed represents the probabilities using 9 parameters (i.e. the nine rows of the various truth tables), whereas the second (intuitively right) represents the probabilities using 8. That means the second model uses fewer bits; the distribution is more compressed by the model. So the "true" network is favored even before we get into interventions.

Implication of this for causal epistemics: we have two models which make the same predictions on-distribution, and only make ... (read more)

1Caspar Oesterheld1mo
>First crucial point which this post is missing: the first (intuitively wrong) net reconstructed represents the probabilities using 9 parameters (i.e. the nine rows of the various truth tables), whereas the second (intuitively right) represents the probabilities using 8. That means the second model uses fewer bits; the distribution is more compressed by the model. So the "true" network is favored even before we get into interventions. > >Implication of this for causal epistemics: we have two models which make the same predictions on-distribution, and only make different predictions under interventions. Yet, even without actually observing any interventions, we do have reason to epistemically favor one model over the other. For people interested in learning more about this idea: This is described in Section 2.3 of Pearl's book Causality [https://yzhu.io/courses/core/reading/04.causality.pdf]. The beginning of Ch. 2 also contains some information about the history of this idea. There's also a more accessible post by Yudkowsky [https://www.lesswrong.com/posts/hzuSDMx7pd2uxFc5w/causal-diagrams-and-causal-models] that has popularized these ideas on LW, though it contains some inaccuracies, such as explicitly equating causal graphs and Bayes nets.
4tailcalled1mo
Thinking about this algorithmically: In e.g. factor analysis, after performing PCA to reduce a high-dimensional dataset to a low-dimensional one, it's common to use varimax to "rotate" the principal components so that each resulting axis has a sparse relationship with the original indicator variables (each "principal" component correlating only with one indicator). However, this instead seems to suggest that one should rotate them so that the resulting axes have a sparse relationship with the original cases (each data point deviating from the mean on as few "principal" components as possible). I believe that this sort of rotation (without the PCA) has actually been used in certain causal inference algorithms, but as far as I can tell it basically assumes that causality flows from variables with higher kurtosis to variables with lower kurtosis, which admittedly seems plausible for a lot of cases, but also seems like it consistently gives the wrong results if you've got certain nonlinear/thresholding effects (which seem plausible in some of the areas I've been looking to apply it). Not sure whether you'd say I'm thinking about this right?  I'm trying to think of why modelling this using a simple intervention is superior to modelling it as e.g. a conditional. One answer I could come up with is if there's some correlations across the different instances of the system, e.g. seasonable variation in rain or similar, or turning the sprinkler on partway through a day. Though these sorts of correlations are probably best modelled by expanding the Bayesian network to include time or similar.
7localdeity1mo
Is this always going to be the case?  I feel like the answer is "not always", but I have no empirical data or theoretical argument here.
6David Johnston1mo
I have a paper (planning to get it on arxiv any day now…) which contains a result: independence of causal mechanisms (which can be related to Occam’s razor & your first point here) + precedent (“things I can do have been done before”) + variety (related to your second point - we’ve observed the phenomena in a meaningfully varied range of circumstances) + conditional independence (which OP used to construct the Bayes net) implies a conditional distribution invariant under action. That is, speaking very loosely, if you add your considerations to OPs recipe for Bayes nets and the assumption of precedent, you can derive something kinda like interventions.

Good question. I recommend looking at this post. The very short version is:

  •  isn't itself a distribution. It's an operator which takes in a model (i.e. ), and spits out distributions of events/variables defined in that model (i.e. ).
  • The model  contains some random variables (i.e.  and maybe others), and somehow specifies how to sample them. I usually picture  as either a Judea Pearl-style causal DAG, or a program which calls rand() sometimes.
  •  is a variable in the model.

I don't know of a good existing write-up on this, and I think it would be valuable for someone to write.

This seems to be arguing against a starry-eyed idealist case for an "AI disarmament treaty", but not really against a cynical/realistic case. (At first I was going to say "arguing against a strawman", but no, there are in fact lots of starry-eyed idealists in alignment.)

Here's my cynical/realistic case for an "AI disarmament treaty" (or something vaguely in that cluster) with China. As the post notes, the regulations mostly provide evidence that Beijing sees near-term AI as a potential threat to stability that needs to be addressed with regulation. For pur... (read more)

3Lao Mein2mo
Parity in AI isn't what China is after - China doesn't want to preserve the status quo. We want to win. We want AI hegemony. We want to be years ahead of the US in terms of our AIs. And frankly, we're not that far behind - the recent Baidu LLMs perform somewhere between GPT2 and GPT3. To tie is to lose. Stopping the race now is the same as losing.  I also don't see how LLMs can destabilize China in the near-term. Spam/propaganda isn't a big issue since you need to submit your real-life ID in order to post on Chinese sites.

There are NNs that train for a lifetime then die, and there are NNs that train for a lifetime but then network together to share all their knowledge before dying.

But crucially, humans do not share all their knowledge. Every time a great scientist or engineer or manager or artist dies, a ton of intuition and skills and illegible knowledge dies with them. What is passed on is only what can be easily compressed into the extremely lossy channels of language.

As the saying goes, "humans are as stupid as they can be while still undergoing intelligence-driven take... (read more)

5jacob_cannell2mo
Edit: Of course humans do not share all their knowledge, and the cultural transition is obviously graded in the sense that the evolutionary stages of early language, writing, printing press, computers, internet etc gradually improve the externalized network connectivity and storage of our cybernetic civilization. But by the time of AGI that transition is already very well along, such that all we are really losing - as you point out and I agree - is a ton of intuitions/skills/knowledge etc that dies with the decay of human brains, but we externalize much of the most important of our knowledge. Nonetheless ending that tragedy is our great common cause. I agree that substrate independence is one of the great advantages of digital minds, other than speed. But there are some fundamental tradeoffs: You can use GPUs (von neumman) which separate compute and logic. They are much much slower in the sense that they take many many cycles to simulate one cycle of a large ANN. They waste much energy having to shuffle the weights around the chip from memory to logic. Or you can use neuromorphic computers, which combine memory and logic. They are potentially enormously faster as they can simulate one cycle of a large ANN per clock cycle, but constrained to more brain like designs and thus optimized for low circuit depth but larger circuits (cheap circuitry). For the greatest cheap circuit density, energy efficiency, and speed you need to use analog synapses but in doing so you basically give up the ability to easily transfer the knowledge out of the system - it becomes more 'mortal' as hinton recently argues.

Yeah, the main changes I'd expect in category 1 are just pushing things further in the directions they're already moving, and then adjusting whatever else needs to be adjusted to match the new hyperparameter values.

One example is brain size: we know brains have generally grown larger in recent evolutionary history, but they're locally-limited by things like e.g. birth canal size. Circumvent the birth canal, and we can keep pushing in the "bigger brain" direction.

Or, another example is the genetic changes accounting for high IQ among the Ashkenazi. In order... (read more)

4jacob_cannell2mo
Chinchilla scaling already suggests the human brain is too big for our lifetime data, and multiple distal lineages that have very little natural size limits (whales, elephants) ended up plateauing around the same OOM similar brain neuron and synapse counts. Human intelligence in terms of brain arch priors also plateaus, the Ashkenazi just selected a bit stronger towards that plateau. Intelligence also has neotany tradeoffs resulting in numerous ecological niches in tribes - faster to breed often wins.

Ah, interesting. If I were going down that path, I'd probably aim to use a Landauer-style argument. Something like, "here's a bound on mutual information between the policy and the whole world, including the agent itself". And then a lock/password could give us a lot more optimization power over one particular part of the world, but not over the world as a whole.

... I'm not sure how to make something like that nontrivial, though. Problem is, the policy itself would then presumably be embedded in the world, so  is just .

3dr_s2mo
Here's my immediate thought on it: you define a single world bit string W, and A, O and Y are just designated subsections of it. You are able to know only the contents of O, and can set the contents of A (this feels like it's reducing the entropy of the whole world btw, so you could also postulate that you can only do so by drawing free energy from some other region, your fuel F: for each bit of A you set deterministically, you need to randomize two of F, so that the overall entropy increases). After this, some kind of map W→f(W) is applied repeatedly, evolving the system until such time comes to check that the region Y is indeed as close as possible to your goal configuration G. I think at this point the properties of the result will depend on the properties of the map - is it a "lock" map like your suggested one (compare a region of A with O, and if they're identical, clone the rest of A into Y, possibly using up F to keep the entropy increase positive?). Is it reversible, is it chaotic? Yeah, not sure, I need to think about it. Reversibility (even acting as if these were qubits and not simple bits) might be the key here. In general I think there can't be any hard rule against lock-like maps, because the real world allows building locks. But maybe there's some rule about how if you define the map itself randomly enough, it probably won't be a lock-map (for example, you could define a map as a series of operations on two bits writing to a third one op(i,j)→k; decide a region of your world for it, encode bit indices and operators as bit strings, and you can make the map's program itself a part of the world, and then define what makes a map a lock-like map and how probable that occurrence is).

This is all assuming that the power consumption for a wire is at-or-near the Landauer-based limit Jacob argued in his post.

Thanks!

Also, I recognize that I'm kinda grouchy about the whole thing and that's probably coming through in my writing, and I appreciate a lot that you're responding politely and helpfully on the other side of that. So thankyou for that too.

Here are two intuitive arguments:

  • If we can't observe O, we could always just guess a particular value of O and then do whatever's optimal for that value. Then with probability P[O], we'll be right, and our performance is lower bounded by P[O]*(whatever optimization pressure we're able to apply if we guess correctly).
  • The log-number of different policies bounds the log-number of different outcome-distributions we can achieve. And observing one additional bit doubles the log-number of different policies.

I mean, sure, but I doubt that e.g. Eliezer thinks evolution is inefficient in that sense.

Basically, there are only a handful of specific ways we should expect to be able to beat evolution in terms of general capabilities, a priori:

  • Some things just haven't had very much time to evolve, so they're probably not near optimal. Broca's area would be an obvious candidate, and more generally whatever things separate human brains from other apes.
  • There's ways to nonlocally redesign the whole system to jump from one local optimum to somewhere else.
  • We're optimizing a
... (read more)

Interesting - I think I disagree most with 1. The neuroscience seems pretty clear that the human brain is just a scaled up standard primate brain, the secret sauce is just language (I discuss this now and again in some posts and in my recent part 2). In other words - nothing new about the human brain has had much time to evolve, all evolution did was tweak a few hyperparams mostly around size and neotany (training time): very very much like GPT-N scaling (which my model predicted).

Basically human technology beats evolution because we are not constrained ... (read more)

In an absolute sense, yes, but I expect it can be bounded as a function of bits of optimization without observation. For instance, if we could only at-most double the number of bits of opt by observing one bit, then that would bound bit-gain as a function of bits of optimization without observation, even though it's unbounded in an absolute sense.

Unless you're seeing some stronger argument which I have not yet seen?

1M. Y. Zuo2mo
The scaling would also be unbounded, at least that would be my default assumption without solid proof otherwise.  In other words I don't see any reason to assume there must be any hard cap, whether at 2x or 10x or 100x, etc...

The new question is: what is the upper bound on bits of optimization gained from a bit of observation? What's the best-case asymptotic scaling? The counterexample suggests it's roughly exponential, i.e. one bit of observation can double the number of bits of optimization. On the other hand, it's not just multiplicative, because our xor example at the top of this post showed a jump from 0 bits of optimization to 1 bit from observing 1 bit.

1Will_BC1mo
I think it depends on the size of the world model. Imagine an agent with a branch due to uncertainty between two world models. It can construct these models in parallel but doesn't know which one is true. Every observation it makes has two interpretations. A single observation which conclusively determines which branch world model was correct I think could produce an arbitrarily large but resource bounded update.
1FireStormOOO2mo
It's possible to construct a counterexample where there's a step from guessing at random to perfect knowledge after an arbitrary number of observed bits; n-1 bits of evidence are worthless alone and the nth bit lets you perfectly predict the next bit and all future bits.   Consider for example shifting bits in one at a time into the input of a known hash function that's been initialized with an unknown value (and known width) and I ask you to guess a specified bit from the output; in the idealized case, you know nothing about the output of the function until you learn the final bit in the input (all unknown bits have shifted out) b/c they're perfectly mixed, and after that you'll guess every future bit correctly. Seems like the pathological cases can be arbitrarily messy.
1M. Y. Zuo2mo
Isn't it unbounded? 

The four claims you listed as "central" at the top of this thread don't even mention the word "brain", let alone anything about it being pareto-efficient.

It would make this whole discussion a lot less frustrating for me (and probably many others following it) if you would spell out what claims you actually intend to make about brains, nanotech, and FOOM gains, with the qualifiers included. And then I could either say "ok, let's see how well the arguments back up those claims" or "even if true, those claims don't actually say much about FOOM because...", rather than this constant probably-well-intended-but-still-very-annoying jumping between stronger and weaker claims.

Ok fair those are more like background ideas/claims, so I reworded that and added 2

Alright, I think we have an answer! The conjecture is false.

Counterexample: suppose I have a very-high-capacity information channel (N bit capacity), but it's guarded by a uniform random n-bit password. O is the password, A is an N-bit message and a guess at the n-bit password. Y is the N-bit message part of A if the password guess matches O; otherwise, Y is 0.

Let's say the password is 50 bits and the message is 1M bits. If A is independent of the password, then there's a  chance of guessing the password, so the bitrate will be about ... (read more)

8dr_s2mo
This is interesting, but it also feels a bit somehow like a "cheat" compared to the more "real" version of this problem (namely, if I know something about the world and can think intelligently about it, how much leverage can I get out of it?). The kind of system in which you can pack so much information in an action and at the cost of a small bit of information you get so much leverage feels like it ought to be artificial. Trivially, this is actually what makes a lock (real or virtual) work: if you have one simple key/password, you get to do whatever with the contents. But the world as a whole doesn't seem to work as a locked system (if it did, we would have magic: just a tiny, specific formula or gesture and we get massive results down the line). I wonder if the key here isn't in the entropy. Your knowing O here allows you to significantly reduce the entropy of the world as a whole. This feels akin to being a Maxwell demon. In the physical world though there are bounds on that sort of observation and action exactly because to be able to do them would allow you to violate the 2nd principle of thermodynamics. So I wonder if the conjecture may be true under some additional constraints which also include these common properties of macroscopic closed physical systems (while it remains false in artificial subsystems that we can build for the purpose, in which we only care about certain bits and not all the ones defining the underlying physical microstates).
3johnswentworth2mo
The new question is: what is the upper bound on bits of optimization gained from a bit of observation? What's the best-case asymptotic scaling? The counterexample suggests it's roughly exponential, i.e. one bit of observation can double the number of bits of optimization. On the other hand, it's not just multiplicative, because our xor example at the top of this post showed a jump from 0 bits of optimization to 1 bit from observing 1 bit.

Trying to patch the thing which I think this example was aiming for:

Let A be an n-bit number, O be 0 or 1 (50/50 distribution). Then let Y = A if , else Y = 0. If the sender knows O, then they can convey n-1 bits with every message (i.e. n bits minus the lowest-order bit). If the sender does not know O, then half the messages are guaranteed to be 0 (and which messages are 0 communicates at most 1 bit per, although I'm pretty sure it's in fact zero bits per in this case, so no loophole there). So at most ~n/2 bits per message can be conveyed if ... (read more)

6Yair Halberstadt2mo
That sounds about right. Tripling is I think definitely a hard max since you can send 3 messages, action if true, action if false, + which is which - at least assuming you can reliably send a bit at all without the observation. More tightly it's doubling + number of bits required to send a single bit of information.

Damn, that one sounded really promising at first, but I don't think it works. Problem is, if A is fixed-length, then knowing the number of 1's also tells us the number of 0's. And since we get to pick P[A] in the optimization problem, we can make A fixed-length.

EDIT: oh, Alex beat me to the punch.

My gloss of the section is 'you could potential make the brain smaller, but it's the size it is because cooling is expensive in a biological context, not necessarily because blind-idiot-god evolution left gains on the table'

I tentatively buy that, but then the argument says little-to-nothing about barriers to AI takeoff. Like, sure, the brain is efficient subject to some constraint which doesn't apply to engineered compute hardware. More generally, the brain is probably efficient relative to lots of constraints which don't apply to engineered compute hardw... (read more)

3jacob_cannell1mo
The main constraint at minimal device sizes is the thermodynamic limit for irreversible computers, so the wire energy constraint is dominant there. However the power dissipation/cooling ability for a 3D computer only scales with the surface area d2, whereas compute device density scales with d3 and interconnect scales somewhere in between. The point of the temperature/cooling section was just to show that shrinking the brain by a factor of X (if possible given space requirements of wire radius etc), would increase surface power density by a factor of X2, but only would decrease wire length&energy by X and would not decrease synapse energy at all. 2D chips scale differently of course: the surface area and heat dissipation tend to both scale with d2. Conventional chips are already approaching miniaturization limits and will dissipate too much power at full activity, but that's a separate investigation. 3D computers like the brain can't run that hot given any fixed tech ability to remove heat per unit surface area. 2D computers are also obviously worse in many respects, as long range interconnect bandwidth (to memory) only scales with d rather than the d2 of compute which is basically terrible compared to a 3D system where compute density and long-range interconnect scales d3 and d2 respectively.
3ADifferentAnonymous2mo
Had it turned out that the brain was big because blind-idiot-god left gains on the table, I'd have considered it evidence of more gains lying on other tables and updated towards faster takeoff.

FWIW, I basically buy all of these, but they are not-at-all sufficient to back up your claims about how superintelligence won't foom (or whatever your actual intended claims are about takeoff). Insofar as all this is supposed to inform AI threat models, it's the weakest subclaims necessary to support the foom-claims which are of interest, not the strongest subclaims.

I basically buy all of these, but they are not-at-all sufficient to back up your claims about how superintelligence won't foom

Foom isn't something that EY can prove beyond doubt or I can disprove beyond doubt, so this is a matter of subjective priors and posteriors.

If you were convinced of foom inevitability before, these claims are unlikely to convince of the opposite, but they do undermine EY's argument:

  • they support the conclusion that the brain is reasonably pareto-efficient (greatly undermining EY's argument that evolution and the brain are grossl
... (read more)

I think you may be misunderstanding why I used the blackbody temp - I (and the refs I linked) use that as a starting point to to indicate the temp the computing element would achieve without convective cooling (ie in vacuum or outer space).

There's a pattern here which seems-to-me to be coming up repeatedly (though this is the most legible example I've seen so far). There's a key qualifier which you did not actually include in your post, which would make the claims true. But once that qualifier is added, it's much more obvious that the arguments are utterly... (read more)

The 'big-sounding' claim you quoted makes more sense only with the preceding context you omitted:

Conclusion: The brain is a million times slower than digital computers, but its slow speed is probably efficient for its given energy budget, as it allows for a full utilization of an enormous memory capacity and memory bandwidth. As a consequence of being very slow, brains are enormously circuit cycle efficient. Thus even some hypothetical superintelligence, running on non-exotic hardware, will not be able to think much faster than an artificial brain runnin

... (read more)

After chewing it on it a bit, I find it very plausible that this is indeed a counterexample. However, it is not obvious to me how to prove that there does not exist some clever encoding scheme which would achieve bit-throughput competitive with the O-dependent encoding without observing O. (Note that we don't actually need to ensure the same Y pops out either way, we just need the receiver to be able to distinguish between enough possible inputs A by looking at Y.)

Ok simpler example:

You know the channel either removes all 0s or all 1s, but you don't know which.

The most efficient way to send a message is to send n 1s, followed by n 0s, where n is the number the binary message you want to send represents.

If you know whether 1s or 0s are stripped out, then you only need to send n bits of information, for a total saving of n bits.

EDIT: this doesn't work, see comment by AlexMennen.

(Note that this, in turn, also completely undermines the claims about optimality of speed in the next section. Those claims ultimately ground out in high temperatures making high clock speeds prohibitive, e.g. this line:

Scaling a brain to GHz speeds would increase energy and thermal output into the 10MW range, and surface power density to  / , with temperatures well above the surface of the sun

)

4jacob_cannell2mo
For extra clarification, that should perhaps read " with (uncooled) temperatures well above ..." (ie isolated in vacuum).

(Copied with some minor edits from here.)

Jacob's argument in the Density and Temperature section of his Brain Efficiency post basically just fails.

Jacob is using a temperature formula for blackbody radiators, which is basically irrelevant to temperature of realistic compute substrate - brains, chips, and probably future compute substrates are all cooled by conduction through direct contact with something cooler (blood for the brain, heatsink/air for a chip). The obvious law to use instead would just be the standard thermal conduction law: heat flow per uni... (read more)

4jacob_cannell1mo
If we fix the neuron/synapse/etc count (and just spread them out evenly across the volume) then length and thus power consumption of interconnect linearly scale with radius R, but the power consumption of compute units (synapses) doesn't scale at all. Surface power density scales with R2. This seems rather obviously incorrect to me: 1. There is simply a maximum amount of heat/entropy any particle of coolant fluid can extract, based on the temperature difference between the coolant particle and the compute medium 2. The maximum flow of coolant particles scales with the surface area. 3. Given a fixed compute temperature limit, coolant temp, and coolant pump rate thus results in a limit on the device radius But obviously I do agree the brain is nowhere near the technological limits of active cooling in terms of entropy removed per unit surface area per unit time, but that's also mostly irrelevant because you expend energy to move the heat and the brain has a small energy budget of 20W. Its coolant budget is proportional to it's compute budget. Moreover as you scale the volume down the coolant travels a shorter distance and has less time to reach equilibrium temp with the compute volume and thus extract the max entropy (but not sure how relevant that is at brain size scales).

I'm going to make this slightly more legible, but not contribute new information.

Note that downthread, Jacob says:

the temp/size scaling part is not one of the more core claims so any correction there probably doesn't change the conclusion much.

So if your interest is in Jacob's arguments as they pertain to AI safety, this chunk of Jacob's writings is probably not key for your understanding and you may want to focus your attention on other aspects.

Both Jacob and John agree on the obvious fact that active cooling is necessary for both the brain and for GPUs a... (read more)

5ADifferentAnonymous2mo
I agree the blackbody formula doesn't seem that relevant, but it's also not clear what relevance Jacob is claiming it has. He does discuss that the brain is actively cooled. So let's look at the conclusion of the section: If the temperature-gradient-scaling works and scaling down is free, this is definitely wrong. But you explicitly flag your low confidence in that scaling, and I'm pretty sure it wouldn't work.* In which case, if the brain were smaller, you'd need either a hotter brain or a colder environment. I think that makes the conclusion true (with the caveat that 'considerations' are not 'fundamental limits'). (My gloss of the section is 'you could potential make the brain smaller, but it's the size it is because cooling is expensive in a biological context, not necessarily because blind-idiot-god evolution left gains on the table'). * I can provide some hand-wavy arguments about this if anyone wants.
6jacob_cannell2mo
I think you may be misunderstanding why I used the blackbody temp - I (and the refs I linked) use that as a starting point to to indicate the temp the computing element would achieve without convective cooling (ie in vacuum or outer space). So when I (or the refs I link) mention "temperatures greater than the surface of the sun" for the surface of some CMOS processor, it is not because we actually believe your GPU achieves that temperature (unless you have some critical cooling failure or short circuit, in which case it briefly achieves a very high temperature before melting somewhere). I think this makes all the wrong predictions and so is likely wrong, but I will consider it more. Of course - not really relevant for the brain, but that is an option for computers. Obviously you aren't gaining thermodynamic efficiency by doing so - you pay extra energy to transport the heat. All that being said, I'm going to look into this more and if I feel a correction to the article is justified I will link to your comment here with a note. But the temp/size scaling part is not one of the more core claims so any correction there probably doesn't change the conclusion much.
7johnswentworth2mo
(Note that this, in turn, also completely undermines the claims about optimality of speed in the next section. Those claims ultimately ground out in high temperatures making high clock speeds prohibitive, e.g. this line: )
Load More