1 min read

6

This is a special post for quick takes by tailcalled. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
tailcalled's Shortform
143 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

If a tree falls in the forest, and two people are around to hear it, does it make a sound?

I feel like typically you'd say yes, it makes a sound. Not two sounds, one for each person, but one sound that both people hear.

But that must mean that a sound is not just auditory experiences, because then there would be two rather than one. Rather it's more like, emissions of acoustic vibrations. But this implies that it also makes a sound when no one is around to hear it.

3Dagon
I think this just repeats the original ambiguity of the question, by using the word "sound" in a context where the common meaning (air vibrations perceived by an agent) is only partly applicable.  It's still a question of definition, not of understanding what actually happens.
3tailcalled
But the way to resolve definitional questions is to come up with definitions that make it easier to find general rules about what happens. This illustrates one way one can do that, by picking edge-cases so they scale nicely with rules that occur in normal cases. (Another example would be 1 as not a prime number.)
2Dagon
My recommended way to resolve (aka disambiguate) definitional questions is "use more words".  Common understandings can be short, but unusual contexts require more signals to communicate.
1Bert
I think we're playing too much with the meaning of "sound" here. The tree causes some vibrations in the air, which leads to two auditory experiences since there are two people

Finally gonna start properly experimenting on stuff. Just writing up what I'm doing to force myself to do something, not claiming this is necessarily particularly important.

Llama (and many other models, but I'm doing experiments on Llama) has a piece of code that looks like this:

        h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask)
       out = h + self.feed_forward(self.ffn_norm(h))

Here, out is the result of the transformer layer (aka the residual stream), and the vectors self.attention(self.attention_norm(x), start_pos, freqs_cis, mask) and self.feed_forward(self.ffn_norm(h)) are basically where all the computation happens. So basically the transformer proceeds as a series of "writes" to the residual stream using these two vectors.

I took all the residual vectors for some queries to Llama-8b and stacked them into a big matrix M with 4096 columns (the internal hidden dimensionality of the model). Then using SVD, I can express , where the 's and 's are independent units vectors. This basically decomposes the "writes" into some independent locations in the residual stream (u's), some lat... (read more)

2tailcalled
Ok, so I've got the clipping working. First, some uninterpretable diagrams: In the bottom six diagrams, I try taking varying number (x-axis) of right singular vectors (v's) and projecting down the "writes" to the residual stream to the space spanned by those vectors. The obvious criterion to care about is whether the projected network reproduces the outputs of the original network, which here I operationalize based on the log probability the projected network gives to the continuation of the prompt (shown in the "generation probability" diagrams). This appears to be fairly chaotic (and low) in the 1-300ish range, and then stabilizes while still being pretty low in the 300ish-1500ish range, and then finally converges to normal in the 1500ish to 2000ish range, and is ~perfect afterwards. The remaining diagrams show something about how/why we have this pattern. "orig_delta" concerns the magnitude of the attempted writes for a given projection (which is not constant because projecting in earlier layers will change the writes by later layers), and "kept_delta" concerns the remaining magnitude after the discarded dimensions have been projected away. In the low end, "kept_delta" is small (and even "orig_delta" is a bit smaller than it ends up being at the high end), indicating that the network fails to reproduce the probabilities because the projection is so aggressive that it simply suppresses the network too much. Then in the middle range, "orig_delta" and "kept_delta" explodes, indicating that the network has some internal runaway dynamics which normally would be suppressed, but where the suppression system is broken by the projection. Finally, in the high range, we get a sudden improvement in loss, and a sudden drop in residual stream "write" size, indicating that it has managed to suppress this runaway stuff and now it works fine.
2tailcalled
An implicit assumption I'm making when I clip off from the end with the smallest singular values is that the importance of a dimension is proportional to its singular values. This seemed intuitively sensible to me ("bigger = more important"), but I thought I should test it, so I tried clipping off only one dimension at a time, and plotting how that affected the probabilities: Clearly there is a correlation, but also clearly there's some deviations from that correlation. Not sure whether I should try to exploit these deviations in order to do further dimension reduction. It's tempting, but it also feels like it starts entering sketchy territories, e.g. overfitting and arbitrary basis picking. Probably gonna do it just to check what happens, but am on the lookout for something more principled.
2tailcalled
Back to clipping away an entire range, rather than a single dimension. Here's ordering it by the importance computed by clipping away a single dimension: Less chaotic maybe, but also much slower at reaching a reasonable performance, so I tried a compromise ordering that takes both size and performance into account: Doesn't seem like it works super great tbh. Edit: for completeness' sake, here's the initial graph with log-surprise-based plotting.
2tailcalled
To quickly find the subspace that the model is using, I can use a binary search to find the number of singular vectors needed before the probability when clipping exceeds the probability when not clipping. A relevant followup is what happens to other samples in response to the prompt when clipping. When I extrapolate "I believe the meaning of life is" using the 1886-dimensional subspace from  , I get: Which seems sort of vaguely related, but idk. Another test is just generating without any prompt, in which case these vectors give me: Using a different prompt: I can get a 3329-dimensional subspace which generates: or Another example: can yield 2696 dimensions with or And finally, can yield the 2518-dimensional subspace: or
2tailcalled
Given the large number of dimensions that are kept in each case, there must be considerable overlap in which dimensions they make use of. But how much? I concatenated the dimensions found in each of the prompts, and performed an SVD of it. It yielded this plot: ... unfortunately this seems close to the worst-case scenario. I had hoped for some split between general and task-specific dimensions, yet this seems like an extremely uniform mixture.
2tailcalled
If I look at the pairwise overlap between the dimensions needed for each generation: ... then this is predictable down to ~1% error simply by assuming that they pick a random subset of the dimensions for each, so their overlap is proportional to each of their individual sizes.
2tailcalled
Oops, my code had a bug so only self.attention(self.attention_norm(x), start_pos, freqs_cis, mask) and not self.feed_forward(self.ffn_norm(h)) was in the SVD. So the diagram isn't 100% accurate.

Thesis: while consciousness isn't literally epiphenomenal, it is approximately epiphenomenal. One way to think of this is that your output bandwidth is much lower than your input bandwidth. Another way to think of this is the prevalence of akrasia, where your conscious mind actually doesn't have full control over your behavior. On a practical level, the ecological reason for this is that it's easier to build a general mind and then use whatever parts of the mind that are useful than to narrow down the mind to only work with a small slice of possibilities. This is quite analogous to how we probably use LLMs for a much narrower set of tasks than what they were trained for.

2Seth Herd
Consciousness is not at all epiphenomenal, it's just not the whole mind and not doing everything. We don't have full control over our behavior, but we have a lot. While the output bandwidth is low, it can be applied to the most important things.
2tailcalled
Maybe a point that was missing from my thesis is that one can have a higher-level psychological theory in terms of life-drives and death-drives which then addresses the important phenomenal activities but doesn't model everything. And then if one asks for an explanation of the unmodelled part, the answer will have to be consciousness. But then because the important phenomenal part is already modelled by the higher-level theory, the relevant theory of consciousness is ~epiphenomenal.
2Seth Herd
I guess I have no idea what you mean by "consciousness" in this context. I expect consciousness to be fully explained and still real. Ah, consciousness. I'm going to mostly save the topic for if we survive AGI and have plenty of spare time to clarify our terminology and work through all of the many meanings of the word. Edit - or of course if something else was meant by consciousness, I expect a full explanation to indicate that thing isn't real at all. I'm an eliminativist or a realist depending on exactly what is meant. People seem to be all over the place on what they mean by the word.
2tailcalled
A thermodynamic analogy might help: Reductionists like to describe all motion in terms of low-level physical dynamics, but that is extremely computationally intractable and arguably also misleading because it obscures entropy. Physicists avoid reductionism by instead factoring their models into macroscopic kinetics and microscopic thermodynamics. Reductionistically, heat is just microscopic motion, but microscopic motion that adds up to macroscopic motion has already been factored out into the macroscopic kinetics, so what remains is microscopic motion that doesn't act like macroscopic motion, either because it is ~epiphenomenal (heat in thermal equilibrium) or because it acts very different from macroscopic motion (heat diffusion). Similarly, reductionists like to describe all psychology in terms of low-level Bayesian decision theory, but that is extremely computationally intractable and arguably also misleading because it obscures entropy. You can avoid reductionism by instead factoring models into some sort of macroscopic psychology-ecology boundary and microscopic neuroses. Luckily Bayesian decision theory is pretty self-similar, so often the macroscopic psychology-ecology boundary fits pretty well with a coarse-grained Bayesian decision theory. Now, similar to how most of the kinetic energy in a system in motion is usually in the microscopic thermal motion rather than in the macroscopic motion, most of the mental activity is usually with the microscopic neuroses instead of the macroscopic psychology-ecology. Thus, whenever you think "consciousness", "self-awareness", "personality", "ideology", or any other broad and general psychological term, it's probably mostly about the microscopic neuroses. Meanwhile, similar to how tons of physical systems are very robust to wide ranges of temperatures, tons of psychology-ecologies are very robust to wide ranges of neuroses. As for what "consciousness" really means, idk, currently I'm thinking it's tightly intertwin

Thesis: There's three distinct coherent notions of "soul": sideways, upwards and downwards.

By "sideways souls", I basically mean what materialists would translate the notion of a soul to: the brain, or its structure, so something like that. By "upwards souls", I mean attempts to remove arbitrary/contingent factors from the sideways souls, for instance by equating the soul with one's genes or utility function. These are different in the particulars, but they seem conceptually similar and mainly differ in how they attempt to cut the question of identity (ide... (read more)

2Dagon
I'm having trouble following whether this categories the definition/concept of a soul, or the causality and content of this conception of soul.  Is "sideways soul" about structure and material implementation, or about weights and connectivity, independent of substrate?  WHICH factors are removed from upwards ("genes" and "utility function" are VERY different dimensions, both tiny parts of what I expect create (for genes) or comprise (for utility function) a soul.  What about memory?  multiple levels of value and preferences (including meta-preferences in how to abstract into "values")? Putting "downwards" supernatural ideas into the same framework as more logical/materialist ideas confuses me - I can't tell if that makes it a more useful model or less.
4tailcalled
When you get into the particulars, there are multiple feasible notions of sideways soul, of which material implementation vs weights and connectivity are the main ones. I'm most sympathetic to weights and connectivity. I have thought less about and seen less discussion about upwards souls. I just mentioned it because I'd seen a brief reference to it once, but I don't know anything in-depth. I agree that both genes and utility function seem incomplete for humans, though for utility maximizers in general I think there is some merit to the soul == utility function view. Memory would usually go in sideways soul, I think. idk Sideways vs upwards vs downwards is more meant to be a contrast between three qualitatively distinct classes of frameworks than it is meant to be a shared framework.
2Seth Herd
Excellent! I like the move of calling this "soul" with no reference to metaphysical souls. This is highly relevant to discussions of "free will" if the real topic is self-determination - which it usually is.
2tailcalled
"Downwards souls are similar to the supernatural notion of souls" is an explicit reference to metaphysical souls, no?
2Seth Herd
um, it claims to be :) I don't think that's got much relationship to the common supernatural notion of souls. But I read it yesterday and forgot that you'd made that reference.
2tailcalled
What special characteristics do you associate with the common supernatural notion of souls which differs from what I described?
2Nathan Helm-Burger
The word 'soul' is so tied in my mind to implausible metaphysical mythologies that I'd parse this better if the word were switched for something like 'quintessence' or 'essential self' or 'distinguishing uniqueness'.
2tailcalled
What implausible metaphysical mythologies is it tied up with? As mentioned in my comment, downwards souls seem to satisfy multiple characteristics we'd associate with mythological souls, so this and other things makes me wonder if the metaphysical mythologies might actually be more plausible than you realize.

Thesis: in addition to probabilities, forecasts should include entropies (how many different conditions are included in the forecast) and temperatures (how intense is the outcome addressed by the marginal constraint in this forecast, i.e. the big-if-true factor).

I say "in addition to" rather than "instead of" because you can't compute probabilities just from these two numbers. If we assume a Gibbs distribution, there's the free parameter of energy: ln(P) = S - E/T. But I'm not sure whether this energy parameter has any sensible meaning with more general ev... (read more)

Reply11111

Thesis: whether or not tradition contains some moral insights, commonly-told biblical stories tend to be too sparse to be informative. For instance, there's no plot-relevant reason why it should be bad for Adam and Eve to have knowledge of good and evil. Maybe there's some interpretation of good and evil where it makes sense, but it seems like then that interpretation should have been embedded more properly in the story.

3UnderTruth
It is worth noting that, in the religious tradition from which the story originates, it is Moses who commits these previously-oral stories to writing, and does so in the context of a continued oral tradition which is intended to exist in parallel with the writings. On their own, the writings are not meant to be complete, both in order to limit more advanced teachings to those deemed ready for them, as well as to provide occasion to seek out the deeper meanings, for those with the right sort of character to do so.
4tailcalled
This makes sense. The context I'm thinking of is my own life, where I come from a secular society with atheist parents, and merely had brief introductions to the stories from bible reading with parents and Christian education in school. (Denmark is a weird society - few people are actually Christian or religious, so it's basically secular, but legally speaking we are Christian and do not have separation between Church and state, so there are random fragments of Christianity we run into.)
2lemonhope
What? Nobody told me. Where did you learn this
4Garrett Baker
This is the justification behind the talmud

Thesis: one of the biggest alignment obstacles is that we often think of the utility function as being basically-local, e.g. that each region has a goodness score and we're summing the goodness over all the regions. This basically-guarantees that there is an optimal pattern for a local region, and thus that the global optimum is just a tiling of that local optimal pattern.

Even if one adds a preference for variation, this likely just means that a distribution of patterns is optimal, and the global optimum will be a tiling of samples from said distribution.

T... (read more)

Current agent models like argmax entirely lack any notion of "energy". Not only does this seem kind of silly on its own, I think it also leads to missing important dynamics related to temperature.

I think I've got it, the fix to the problem in my corrigibility thing!

So to recap: It seems to me that for the stop button problem, we want humans to control whether the AI stops or runs freely, which is a causal notion, and so we should use counterfactuals in our utility function to describe it. (Dunno why most people don't do this.) That is, if we say that the AI's utility should depend on the counterfactuals related to human behavior, then it will want to observe humans to get input on what to do, rather than manipulate them, because this is the only wa... (read more)

2tailcalled
It also might be vulnerable to some variant of the critiques that were first raised against it, because now the conditional introduces a link between its policy and the scenarios it faces, but I can't immediately construct a case where it happens, because the conditional would still somewhat tend to sabotage the obvious approaches. This sort of unclarity is kind of concerning when it comes to the idea.
2tailcalled
Like suppose the AI immediately very publically does something that looks very unsafe. Say grabs control over the stop button and starts mass-producing paperclips in an extremely publically visible way. This would probably lead to people wanting to stop it. So therefore, if it has a policy like that, the |S conditional would lead to people quickly wanting to stop it. This means that in the |S branch, it can quickly determine whether it is in the f|S branch or the s|S branch; in the f|S case, it can then keep going with whatever optimization V specified, while in the s|S case, it can then immediately shut down itself. But the reason I think the AI *wouldn't* do this is, what about the |F branch? If you condition on humans not wanting to press the stop button even though there's a clearly unaligned AI, what sort of situation could produce this? I have trouble imagining it, because it seems like it would need to be pretty extreme. The best ideas I can come up with is stuff like "black hole swallows the earth", but this would rank pretty low in the AI's utility function, and therefore it would avoid acting this way in order to have a reasonable |F branch. But this does not seem like sane reasoning on the AI's side to me, so it seems like this should be fixed. And of course, fixed in a principled rather than unprincipled way.

I was surprised to see this on twitter:

I mean, I'm pretty sure I knew what caused it (this thread or this market), and I guess I knew from Zack's stuff that rationalist cultism had gotten pretty far, but I still hadn't expected that something this small would lead to being blocked.

FYI: I have a low bar for blocking people who have according-to-me bad, overconfident, takes about probability theory, in particular. For whatever reason, I find people making claims about that topic, in particular, really frustrating. ¯\_(ツ)_/¯

The block isn't meant as a punishment, just a "I get to curate my online experience however I want."

2tailcalled
I think blocks are pretty irrelevant unless one conditions on the particular details of the situation. In this case I think the messages I were sharing are very important. If you think my messages are instead unimportant or outright wrong, then I understand why you would find the block less interesting, but in that case I don't think we can meaningfully discuss it without knowing why you disagree with the messages.

I'm not particularly interested in discussing it in depth. I'm more like giving you a data-point in favor of not taking the block personally, or particularly reading into it. 

(But yeah, "I think these messages are very important", is likely to trigger my personal "bad, overconfident takes about proabrbility theory" neurosis.)

This is awkwardly armchair, but… my impression of Eliezer includes him being just so tired, both specifically from having sacrificed his present energy in the past while pushing to rectify the path of AI development (by his own model thereof, of course!) and maybe for broader zeitgeist reasons that are hard for me to describe. As a result, I expect him to have entered into the natural pattern of having a very low threshold for handing out blocks on Twitter, both because he's beset by a large amount of sneering and crankage in his particular position and because the platform easily becomes a sinkhole in cognitive/experiential ways that are hard for me to describe but are greatly intertwined with the aforementioned zeitgeist tiredness.

Something like: when people run heavily out of certain kinds of slack for dealing with The Other, they reach a kind of contextual-but-bleed-prone scarcity-based closed-mindedness of necessity, something that both looks and can become “cultish” but where reaching for that adjective first is misleading about the structure around it. I haven't succeeded in extracting a more legible model of this, and I bet my perception is still skew to the reality, but I'... (read more)

I disagree with the sibling thread about this kind of post being “low cost”, BTW; I think adding salience to “who blocked whom” types of considerations can be subtly very costly.

 

I agree publicizing blocks has costs, but so does a strong advocate of something with a pattern of blocking critics. People publicly announcing "Bob blocked me" is often the only way to find out if Bob has such a pattern. 

I do think it was ridiculous to call this cultish. Tuning out critics can be evidence of several kinds of problems, but not particularly that one. 

5tailcalled
I agree that it is ridiculous to call this cultish if this was the only evidence, but we've got other lines of evidence pointing towards cultishness, so I'm making a claim of attribution more so than a claim of evidence.
3M. Y. Zuo
Blocking a lot isn’t necessarily bad or unproductive… but in this case it’s practically certain blocking thousands will eventually lead to blocking someone genuinely more correct/competent/intelligent/experienced/etc… than himself, due to sheer probability. (Since even a ‘sneering’ crank is far from literal random noise.) Which wouldn’t matter at all for someone just messing around for fun, who can just treat X as a text-heavy entertainment system. But it does matter somewhat for anyone trying to do something meaningful and/or accomplish certain goals. In short, blocking does have some, variable, credibility cost. Ranging from near zero to quite a lot, depending on who the blockee is.
2tailcalled
Eliezer Yudkowsky being tired isn't an unrelated accident though. Bayesian decision theory in general intrinsically causes fatigue by relying on people to use their own actions to move outcomes instead of getting leverage from destiny/higher powers, which matches what you say about him having sacrificed his present energy for this. Similarly, "being Twitterized" is just about stewing in garbage and cursed information, such that one is forced to filter extremely aggressively, but blocking high-quality information sources accelerates the Twitterization by changing the ratio of blessed to garbage/cursed information. On the contrary, I think raising salience of such discussions helps clear up the "informational food chain", allowing us to map out where there are underused opportunities and toxic accumulation.
6Richard_Kennaway
It seems likely to me that Eliezer blocked you because he has concluded that you are a low-quality information source, no longer worth the effort of engaging with.
4tailcalled
I agree that this is likely Eliezer's mental state. I think this belief is false, but for someone who thinks it's true, there's of course no problem here.
6Richard_Kennaway
Please say more about this. Where can I get some?
6tailcalled
Working on writing stuff but it's not developed enough yet. To begin with you can read my Linear Diffusion of Sparse Lognormals sequence, but it's not really oriented towards practical applications.
2Richard_Kennaway
I will look forward to that. I have read the LDSL posts, but I cannot say that I understand them, or guess what the connection might be with destiny and higher powers.
2tailcalled
One of the big open questions that the LDSL sequence hasn't addressed yet is, what starts all the lognormals and why are they so commensurate with each other. So far, the best answer I've been able to come up with is a thermodynamic approach (hence my various recent comments about thermodynamics). The lognormals all originate as emanations from the sun, which is obviously a higher power. They then split up and recombine in various complicated ways. As for destiny: The sun throws in a lot of free energy, which can be developed in various ways, increasing entropy along the way. But some developments don't work very well, e.g. self-sabotaging (fire), degenerating (parasitism leading to capabilities becoming vestigial), or otherwise getting "stuck". But it's not all developments that get stuck, some developments lead to continuous progress (sunlight -> cells -> eukaryotes -> animals -> mammals -> humans -> society -> capitalism -> ?). This continuous progress is not just accidental, but rather an intrinsic part of the possibility landscape. For instance, eyes have evolved in parallel to very similar structures, and even modern cameras have a lot in common with eyes. There's basically some developments that intrinsically unblock lots of derived developments while preferentially unblocking developments that defend themselves over developments that sabotage themselves. Thus as entropy increases, such developments will intrinsically be favored by the universe. That's destiny. Critically, getting people to change many small behaviors in accordance with long explanations contradicts destiny because it is all about homogenizing things and adding additional constraints whereas destiny is all about differentiating things and releasing constraints.
7quetzal_rainbow
Meta-point: your communication pattern fits with following pattern: The reason why smart people find themselves in this pattern is because they expect short inferential distances, i.e., they see their argumentation not like vague esoteric crackpottery, but like a set of very clear statements and fail to put themselves in shoes of people who are going to read this, and they especially fail to account for fact that readers already distrust them because they started conversation with <controversial statement>. On object level, as stated, you are wrong. Observing heuristic failing should decrease your confidence ih heuristic. You can argue that your update should be small, due to, say, measurement errors or strong priors, but direction of update should be strictly down. 
2tailcalled
Can you fill in a particular example of me engaging in that pattern so we can address it in the concrete rather than in the abstract?
7quetzal_rainbow
To be clear, I mean "your communication in this particular thread". Pattern: <controversial statement> <this statement is false> <controversial statement> <this statement is false> <mix of "this is trivially true because" and "here is my blogpost with esoteric terminology"> The following responses from EY are more in genre "I ain't reading this", because he is more using you as example for other readers than talking directly to you, with following block. 
3tailcalled
This statement had two parts. Part 1: And part 2: Part 2 is what Eliezer said was false, but it's not really central to my point (hence why I didn't write much about it in the original thread), and so it is self-sabotaging of Eliezer to zoom into this rather than the actually informative point.
5MondSemmel
People should feel free to liberally block one another on social media. Being blocked is not enough to warrant an accusation of cultism.
8tailcalled
I did not say that simply blocking me warrants an accusation of cultism. I highlighted the fact that I had been blocked and the context in which it occurred, and then brought up other angles which evidenced cultism. If you think my views are pathetic and aren't the least bit alarmed by them being blocked, then feel free to feel that way, but I suspect there are at least some people here who'd like to keep track of how the rationalist isolation is progressing and who see merit in my positions.
1MondSemmel
Again, people block one another on social media for any number of reasons. That just doesn't warrant feeling alarmed or like your views are pathetic.
4tailcalled
We know what the root cause is, you don't have to act like it's totally mysterious. So the question is, was this root cause (pushback against Eliezer's Bayesianism): * An important insight that Eliezer was missing (alarming!) * Worthless pedantry that he might as well block (nbd/pathetic) * Antisocial trolling that ought to be gotten rid of (reassuring that he blocked) * ... or something else Regardless of which of these is the true one, it seems informative to highlight for anyone who is keeping track of what is happening around me. And if the first one is the true one, it seems like people who are keeping track of what is happening around Eliezer would also want to know it. Especially since it only takes a very brief moment to post and link about getting blocked. Low cost action, potentially high reward.

MIRI full-time employed many critics of bayesianism for 5+ years and MIRI researchers themselves argued most of the points you made in these arguments. It is obviously not the case that critiquing bayesianism is the reason why you got blocked.

5tailcalled
Idk, maybe you've got a point, but Eliezer was very quick to insist what I said was not the mainstream view and disengage. And MIRI was full of internal distrust. I don't know enough of the situation to know if this explains it, but it seems plausible to me that the way MIRI kept stuff together was by insisting on a Bayesian approach, and that some generators of internal dissent was from people whose intuition aligned more with non-Bayesian approach. For that matter, an important split in rationalism is MIRI/CFAR vs the Vassarites, and while I wouldn't really say the Vassarites formed a major inspiration for LDSL, after coming up with LDSL I've totally reevaluated my interpretation of that conflict as being about MIRI/CFAR using a Bayesian approach and the Vassarites using an LDSL approach. (Not absolutely of course, everyone has a mixture of both, but in terms of relative differences.)

I've been thinking about how the way to talk about how a neural network works (instead of how it could hypothetically come to work by adding new features) would be to project away components of its activations/weights, but I got stuck because of the issue where you can add new components by subtracting off large irrelevant components.

I've also been thinking about deception and its relationship to "natural abstractions", and in that case it seems to me that our primary hope would be that the concepts we care about are represented at a large "magnitude" than... (read more)

4Thomas Kwa
Much dumber ideas have turned into excellent papers
2tailcalled
True, though I think the Hessian is problematic enough that that I'd either want to wait until I have something better, or want to use a simpler method. It might be worth going into more detail about that. The Hessian for the probability of a neural network output is mostly determined by the Jacobian of the network. But in some cases the Jacobian gives us exactly the opposite of what we want. If we consider the toy model of a neural network with no input neurons and only 1 output neuron g(w)=∏iwi (which I imagine to represent a path through the network, i.e. a bunch of weights get multiplied along the layers to the end), then the Jacobian is the gradient (Jg(w))j=(∇g(w))j=∏i≠jwi=∏iwiwj. If we ignore the overall magnitude of this vector and just consider how the contribution that it assigns to each weight varies over the weights, then we get (Jg(w))j∝1wj. Yet for this toy model, "obviously" the contribution of weight j "should" be proportional to wj. So derivative-based methods seem to give the absolutely worst-possible answer in this case, which makes me pessimistic about their ability to meaningfully separate the actual mechanisms of the network (again they may very well work for other things, such as finding ways of changing the network "on the margin" to be nicer).

One thing that seems really important for agency is perception. And one thing that seems really important for perception is representation learning. Where representation learning involves taking a complex universe (or perhaps rather, complex sense-data) and choosing features of that universe that are useful for modelling things.

When the features are linearly related to the observations/state of the universe, I feel like I have a really good grasp of how to think about this. But most of the time, the features will be nonlinearly related; e.g. in order to do... (read more)

Thesis: money = negative entropy, wealth = heat/bound energy, prices = coldness/inverse temperature, Baumol effect = heat diffusion, arbitrage opportunity = free energy.

2tailcalled
Maybe this mainly works because the economy is intelligence-constrained (since intelligence works by pulling off negentropy from free energy), and it will break down shortly after human-level AGI?

Thesis: there's a condition/trauma that arises from having spent a lot of time in an environment where there's excess resources for no reasons, which can lead to several outcomes:

  • Inertial drifting in the direction implied by ones' prior adaptations,
  • Conformity/adaptation to social popularity contests based on the urges above,
  • Getting lost in meta-level preparations,
  • Acting as a stickler for the authorities,
  • "Bite the hand that feeds you",
  • Tracking the resource/motivation flows present.

By contrast, if resources are contingent on a particular reason, everything takes shape according to said reason, and so one cannot make a general characterization of the outcomes.

1Mateusz Bagiński
It's not clear to me how this results from "excess resources for no reasons". I guess the "for no reasons" part is crucial here?

Thesis: the median entity in any large group never matters and therefore the median voter doesn't matter and therefore the median voter theorem proves that democracies get obsessed about stuff that doesn't matter.

2Dagon
A lot depends on your definition of "matter".  Interesting and important debates are always on margins of disagreement.  The median member likely has a TON of important beliefs and activities that are uncontroversial and ignored for most things.  Those things matter, and they matter more than 95% of what gets debated and focused on.   The question isn't whether the entities matter, but whether the highlighted, debated topics matter.

I recently wrote a post about myopia, and one thing I found difficult when writing the post was in really justifying its usefulness. So eventually I mostly gave up, leaving just the point that it can be used for some general analysis (which I still think is true), but without doing any optimality proofs.

But now I've been thinking about it further, and I think I've realized - don't we lack formal proofs of the usefulness of myopia in general? Myopia seems to mostly be justified by the observation that we're already being myopic in some ways, e.g. when train... (read more)

2Charlie Steiner
Yeah, I think usually when people are interested in myopia, it's because they think there's some desired solution to the problem that is myopic / local, and they want to try to force the algorithm to find that solution rather than some other one. E.g. answering a question based only on some function of its contents, rather than based on the long-term impact of different answers. I think that once you postulate such a desired myopic solution and its non-myopic competitors, then you can easily prove that myopia helps. But this still leaves the question of how we know this problems statement is true - if there's a simpler myopic solution that's bad, then myopia won't help (so how can we predict if this is true?) and if there's a simpler non-myopic solution that's good, myopia may actively hurt (this one seems a little easier to predict though).

Thesis: a general-purpose interpretability method for utility-maximizing adversarial search is a sufficient and feasible solution to the alignment problem. Simple games like chess have sufficient features/complexity to work as a toy model for developing this, as long as you don't rely overly much on preexisting human interpretations for the game, but instead build the interpretability from the ground-up.

The universe has many conserved and approximately-conserved quantities, yet among them energy feels "special" to me. Some speculations why:

  • The sun bombards the earth with a steady stream of free energy, which leaves out into the night.
  • Time-evolution is determined by a 90-degree rotation of energy (Schrodinger equation/Hamiltonian mechanics).
  • Breaking a system down into smaller components primarily requires energy.
  • While aspects of thermodynamics could apply to many conserved quantities, we usually apply it to energy only, and it was first discovered in the c
... (read more)
8jacob_drori
Sure, there are plenty of quantities that are globally conserved at the fundamental (QFT) level. But most most of.these quantities aren't transferred between objects at the everyday, macro level we humans are used to. E.g. 1: most everyday objects have neutral electrical charge (because there exist positive and negative charges, which tend to attract and roughly cancel out) so conservation of charge isn't very useful in day-to-day life. E.g. 2: conservation of color charge doesn't really say anything useful about everyday processes, since it's only changed by subatomic processes (this is again basically due to the screening effect of particles with negative color charge, though the story here is much more subtle, since the main screening effect is due to virtual particles rather than real ones). The only other fundamental conserved quantity I can think of that is nontrivially exchanged between objects at the macro level is momentum. And... momentum seems roughly as important as energy? I guess there is a question about why energy, rather than momentum, appears in thermodynamics. If you're interested, I can answer in a separate comment.
2tailcalled
At a human level, the counts for each type of atom is basically always conserved too, so it's not just a question of why not momentum but also a question of why not moles of hydrogen, moles of carbon, moles of oxygen, moles of nitrogen, moles of silicon, moles of iron, etc.. I guess for momentum in particular, it seems reasonable why it wouldn't be useful in a thermodynamics-style model because things would woosh away too much (unless you're dealing with some sort of flow? Idk). A formalization or refutation of this intuition would be somewhat neat, but I would actually more wonder, could one replace the energy-first formulations of quantum mechanics with momentum-first formulations?
1jacob_drori
> could one replace the energy-first formulations of quantum mechanics with momentum-first formulations? Momentum is to space what energy is to time. Precisely, energy generates (in the Lie group sense) time-translations, whereas momentum generates spatial translations. So any question about ways in which energy and momentum differ is really a question about how time and space differ. In ordinary quantum mechanics, time and space are treated very differently: t is a coordinate whereas x is a dynamical variable (which happens to be operator-valued). The equations of QM tell us how x evolves as a function of t. But ordinary QM was long-ago replaced by quantum field theory, in which time and space are on a much more even footing: they are both coordinates, and the equations of QFT tell us how a third thing (the field ϕ(x,t)) evolves as a function of x and t. Now, the only difference between time and space is that there is only one dimension of the former but three of the latter (there may be some other very subtle differences I'm glossing over here, but I wouldn't be surprised if they ultimately stem from this one). All of this is to say: our best theory of how nature works (QFT), is neither formulated as "energy-first" nor as "momentum-first". Instead, energy and momentum are on fairly equal footing.
2tailcalled
I suppose that's true, but this kind of confirms my intuition that there's something funky going on here that isn't accounted for by rationalist-empiricist-reductionism. Like why are time translations so much more important for our general work than space translations? I guess because the sun bombards the earth with a steady stream of free energy, and earth has life which continuously uses this sunlight to stay out of equillbrium. In a lifeless solar system, time-translations just let everything spin, which isn't that different from space-translations.
1jacob_drori
Ah, so I think you're saying "You've explained to me the precise reason why energy and momentum (i.e. time and space) are different at the fundamental level, but why does this lead to the differences we observe between energy and momentum (time and space) at the macro-level? This is a great question, and as with any question of the form "why does this property emerge from these basic rules", there's unlikely to be a short answer. E.g. if you said "given our understanding of the standard model, explain how a cell works", I'd have to reply "uhh, get out a pen and paper and get ready to churn through equations for several decades". In this case, one might be able to point to a few key points that tell the rough story. You'd want to look at properties of solutions PDEs on manifolds with metric of signature (1,3) (which means "one direction on the manifold is different to the other three, in that it carries a minus sign in the metric compared to the others in the metric"). I imagine that, generically, these solutions behave differently with respect to the "1" direction and the "3" directions. These differences will lead to the rest of the emergent differences between space and time. Sorry I can't be more specific!
2tailcalled
Why assume a reductionistic explanation, rather than a macroscopic explanation? Like for instance the second law of thermodynamics is well-explained by the past hypothesis but not at all explained by churning through mechanistic equations. This seems in some ways to have a similar vibe to the second law.
1[comment deleted]
3Noosphere89
The best answer to the question is that it serves as essentially a universal resource that can be used to provide a measuring stick. It does this by being a resource that is limited, fungible, always is better to have more of than less of, and is additive across decisions: You have a limited amount of joules of energy/negentropy, but you can spend it on essentially arbitrary goods for your utility, and is essentially a more physical and usable form of money in an economy. Also, more energy is always a positive thing, so that means you never are worse off by having more energy, and energy is linear in the sense that if I've spent 10 joules on computation, and spent another 10 joules on computation 1 minute later, I've spent 20 joules in total. Cf this post on the measuring stick of utility problem: https://www.lesswrong.com/posts/73pTioGZKNcfQmvGF/the-measuring-stick-of-utility-problem
-1tailcalled
Agree that free energy in many ways seems like a good resource to use as a measuring stick. But matter is too available and takes too much energy to make, so you can't spend it on matter in practice. So it's non-obvious why we wouldn't have a matter-thermodynamics as well as an energy-thermodynamics. I guess especially with oxygen, since it is so reactive. I guess one limitation with considering a system where oxygen serves an analogous role to sunlight (beyond such systems being intrinsically rare) is that as the oxygen reacts, it takes up elements, and so you cannot have the "used-up" oxygen leave the system again without diminishing the system. Whereas you can have photons leave again. Maybe this is just the fungibility property again, which to some extent seems like the inverse of the "breaking a system down into smaller components primarily requires energy" property (though your statements of fungibility is more general because it also considers kinetic energy).
2tailcalled
Thinking further, a key part of it is that temperature has a tendency to mix stuff together, due to the associated microscopic kinetic energy.

Thesis: the problem with LLM interpretability is that LLMs cannot do very much, so for almost all purposes "prompt X => outcome Y" is all the interpretation we can get.

Counterthesis: LLMs are fiddly and usually it would be nice to understand what ways one can change prompts to improve their effectiveness.

Synthesis: LLM interpretability needs to start with some application (e.g. customer support chatbot) to extend the external subject matter that actually drives the effectiveness of the LLM into the study.

Problem: this seems difficult to access, and the people who have access to it are busy doing their job.

1sunwillrise
I'm very confused. Can we not do LLM interpretability to try to figure out whether or where superposition holds? Is it not useful to see how SAEs help us identify and intervene on specific internal representations that LLMs generate for real-world concepts? As an outsider to interpretability, it has long been my (rough) understanding that most of the useful work in interpretability deals precisely with attempts to figure out what is going on inside the model rather than how it responds to outside prompts. So I don't know what the thesis statement refers to...
2tailcalled
I guess to clarify: Everything has an insanely large amount of information. To interpret something, we need to be able to see what "energy" (definitely literal energy, but likely also metaphorical energy) that information relates to, as the energy is more bounded and unified than the information. But that's (the thesis goes) hard for LLMs.
2tailcalled
Not really, because this requires some notion of the same vs distinct features, which is not so interesting when the use of LLMs is so brief. I don't think so since you've often got more direct ways of intervening (e.g. applying gradient updates).
1sunwillrise
I'm sorry, but I still don't really understand what you mean here. The phrase "the use of LLMs is so brief" is ambiguous to me. Do you mean to say: * a new, better LLM will come out soon anyway, making your work on current LLMs obsolete? * LLM context windows are really small, so you "use" them only for a brief time? * the entire LLM paradigm will be replaced by something else soon? * something totally different from all of the above? But isn't this rather... prosaic and "mundane"?  I thought the idea behind these methods that I have linked was to serve as the building blocks for future work on ontology identification and ultimately getting a clearer picture of what is going on internally, which is a crucial part of stuff like Wentworth's "Retarget the Search" and other research directions like it.  So the fact that SAE-based updates of the model do not currently result in more impressive outputs than basic fine-tuning does not matter as much compared to the fact that they work at all, which gives us reason to believe that we might be able to scale them up to useful, strong-interpretability levels. Or at the very least that the insights we get from them could help in future efforts to obtain this. Kind of like how you can teach a dog to sit pretty well just by basic reinforcement, but if you actually had a gears-level understanding of how its brain worked, down to the minute details, and the ability to directly modify the circuits in its mind that represented the concept of "sitting", then you would be able to do this much more quickly, efficiently, and robustly. Am I totally off-base here?
2tailcalled
Maybe it helps if I start by giving some different applications one might want to use artificial agency for: As a map: We might want to use the LLM as a map of the world, for instance by prompting us with data from the world and having it assist us with navigating that data. Now, the purpose of a map is to reflect as little information as possible about the world while still providing the bare minimum backbone needed to navigate the world. This doesn't work well with LLMs because they are instead trained to model information, so they will carry as much information as possible, and any map-making they do will be an accident driven by mimicking the information it's seen of mapmakers, rather than primarily as an attempt to eliminate information about the world. As a controller: We might want to use the LLM to perform small pushes to a chaotic system at times when the system reaches bifurcations where its state is extremely sensitive, such that the system moves in a desirable direction. But again I think LLMs are so busy copying information around that they don't notice such sensitivities except by accident. As a coder: Since LLMs are so busy outputting information instead of manipulating "energy", maybe we could hope that they could assemble a big pile of information that we could "energize" in a relevant way, e.g. if they could write a large codebase and we could then excute it on a CPU and have a program that does something interesting in the world. But in order for this to work, the program shouldn't have obstacles that stop the "energy" dead in its tracks (e.g. bugs that cause it to crash). But again the LLM isn't optimizing for doing that, it's just trying to copy information around that looks like software, and it only makes space for the energy of the CPU and the program functionality as a side-effect of that. (Or as the old saying goes, it's maximizing lines of code written, not minimizing lines of code used.) So, that gives us the thesis: To interpret the