silentbob's Shortform

silentbob

silentbob's Shortform — LessWrong

silentbob's Shortform

25th Jun 2024

1 min read

6

This is a special post for quick takes by silentbob. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Mentioned in

46We’re not as 3-Dimensional as We Think

40 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:51 AM

[-]silentbob2y440

One crucial question in understanding and predicting the learning process, and ultimately the behavior, of modern neural networks, is that of the shape of their loss landscapes. What does this extremely high dimensional landscape look like? Does training generally tend to find minima? Do minima even exist? Is it predictable what type of minima (or regions of lower loss) are found during training? What role does initial randomization play? Are there specific types of basins in the landscape that are qualitatively different from others, that we might care about for safety reasons?

First, let’s just briefly think about very high dimensional spaces. One somewhat obvious observation is that they are absolutely vast. With each added dimension, the volume of the available space increases exponentially. Intuitively we tend to think of 3-dimensional spaces, and often apply this visual/spatial intuition to our understanding of loss landscapes. But this can be extremely misleading. Parameter spaces are utterly incredibly vast to a degree that our brain can hardly fathom. Take GPT3 for instance. It has 175 billion parameters, or dimensions. Let’s assume somewhat arbitrarily that all parameters end up in a range of [-0.5, 0.5], i.e. live in a 175-billion-dimensional unit cube around the origin of that space (as this is not the case, the real parameter space is actually even much, much larger, but bear with me). Even though every single axis only varies by 1 – let’s just for the sake of it interpret this as “1 meter” – even just taking the diagonal from one corner to the opposite one in this high-dimensional cube, you would get a length of ~420km. So if, hypothetically, you were sitting in the middle of this high dimensional unit cube, you could easily touch every single wall with your hand. But nonetheless, all the corners would be more than 200km distant from you.

This may be mind boggling, but is it relevant? I think it is. Take this realization for instance: if you have two minima in this high dimensional space, but one is just a tiny bit “flatter” than the other (meaning the second derivatives overall are a bit closer to 0), then the attractor basin of this flatter minimum is vastly larger than that of the other minimum. This is because the flatness implies a larger radius, and the volume depends exponentially on that radius. So, at 175 billion dimensions, even a microscopically larger radius means an overwhelmingly larger volume. If, for instance, one minimum’s attractor basin has a radius that is just 0.00000001% larger than that of the other minimum, then its volume will be roughly 40 million times larger (if my Javascript code to calculate this is accurate enough, that is). And this is only for GPT3, which is almost 4 years old by now.

The parameter space is just ridiculously large, so it becomes really crucial how the search process through it works and where it lands. It may be that somewhere in this vast space, there are indeed attractor basins that correspond to minima that we find extremely undesirable – certain capable optimizers perhaps, that have situational awareness and deceptive tendencies. If they do exist, what could we possibly tell about them? Maybe these minima have huge attractor basins that are reliably found eventually (maybe once we switch to a different network architecture, or find some adjustment to gradient descent, or reach a certain model size, or whatever), which would of course be bad news. Or maybe these attractor basins are so vanishingly small that we basically don’t have to care about them at all, because all the computer & search capacity of humanity over the next million years would have an almost 0 chance of ever stumbling onto these regions. Maybe they are even so small that they are numerically unstable, and even if your search process through some incredible cosmic coincidence happens to start right in such a basin, the first SGD step would immediately jump out of it due to the limitations of numerical accuracy on the hardware we’re using.

So, what can we actually tell at this point about the nature of high dimensional loss landscapes? While reading up on this topic, one thing that constantly came up is the fact that, the more dimensions you have, the lower the relative number of minima becomes compared to saddle points. Meaning that whenever the training process appears to slow down and it looks like it found some local minimum, it’s actually overwhelmingly likely that what it actually found is a saddle point, hence the training process never halts but keeps moving through parameter space, even if the loss doesn't change that much. Do local minima exist at all? I guess it depends on the function the neural network is learning to approximate. Maybe some loss landscapes exist where the loss can just get asymptotically closer to some minimum (such as 0), without ever reaching it. And probably other loss landscapes exist where you actually have a global minimum, as well as several local ones.

Some people argue that you probably have no minima at all, because with each added dimension it becomes less and less likely that a given point is a minimum (because not only does the first derivative of a point have to be 0 for it to be a minimum, also all the second derivatives need to be in on it, and all be positive). This sounds compelling, but given that the space itself also grows exponentially with each dimension, we also have overwhelmingly more points to choose from. If you e.g. look at n-dimensional Perlin Noise, its absolute number of local minima within an n-dimensional cube of constant side length actually increases with each added dimension. However, the relative number of local minima compared to the available space still decreases, so it becomes harder and harder to find them.

I’ll keep it at that. This is already not much of a "quick" take. Basically, more research is needed, as my literature review on this subject yielded way more questions than answers, and many of the claims people made in their blog posts, articles and sometimes even papers seemed to be more intuitive / common-sensical or generalized from maybe-not-that-easy-to-validly-generalize-from research.

One thing I’m sure about however is that almost any explanation of how (stochastic) gradient descent works, that uses 3D landscapes for intuitive visualizations, is misleading in many ways. Maybe it is the best we have, but imho all such explainers should come with huge asterisks, explaining that the rules in very high dimensional spaces may look much different than our naive “oh look at that nice valley over there, let’s walk down to its minimum!” understanding, that happens to work well in three dimensions.

[-]Jesse Hoogland2y2313

I'd like to point out that for neural networks, isolated critical points (whether minima, maxima, or saddle points) basically do not exist. Instead, it's valleys and ridges all the way down. So the word "basin" (which suggests the geometry is parabolic) is misleading.

Because critical points are non-isolated, there are more important kinds of "flatness" than having small second derivatives. Neural networks have degenerate loss landscapes: their Hessians have zero-valued eigenvalues, which means there are directions you can walk along that don't change the loss (or that change the loss by a cubic or higher power rather than a quadratic power). The dominant contribution to how volume scales in the loss landscape comes from the behavior of the loss in those degenerate directions. This is much more significant than the behavior of the quadratic directions. The amount of degeneracy is quantified by singular learning theory's local learning coefficient (LLC).

In the Bayesian setting, the relationship between geometric degeneracy and inductive biases is well understood through Watanabe's free energy formula. There's an inductive bias towards more degenerate parts of parameter space that's especially strong earlier in the learning process.

[-]avturchin2y20

I heard that there is no local minima in high-dimensional spaces because there will be almost always paths to global minimum.

[-]Joel Burget2y*20

If, for instance, one minimum’s attractor basin has a radius that is just 0.00000001% larger than that of the other minimum, then its volume will be roughly 40 million times larger (if my Javascript code to calculate this is accurate enough, that is).

Could you share this code? I'd like to take a look.

[-]silentbob2y30

Maybe I accidentally overpromised here :D this code is just an expression, namely 1.0000000001 ** 175000000000, which, as wolframalpha agrees, yields 3.98e7.

[-]silentbob4mo345

After using Claude Code for a while, I can't help but conclude that today's frontier LLMs mostly meet the bar for what I'd consider AGI - with the exception of two things, that, I think, explain most of their shortcomings:

lack of real multimodality
context window limitations

Most frontier models are marketed as multimodal, but this is often limited to text + some way to encode images. And while LLM vision is OK for many practical purposes, it's far from perfect, and even if they had perfect sight, being limited to singular images is still a huge limitation^[1].

Imagine you, with your human general intelligence, were sitting in a dark room, and were conversing with someone who has a complex, difficult problem to solve, and you do your best to help them. But you can only communicate through a mostly text-based interface that allows this person to send you occasional screenshots or photos. Further imagine that every hour or so you lose your entire memory & mental model of the problem, and find yourself with nothing but a high-level and very lossy summary of what has been discussed before.

I think it's very likely that under such restrictive circumstances, it's just very hard to not run into all kinds of failure modes and limitations of capability, even for the undoubtedly general intelligence that is you.

So, in some sense, I'd think that there's an "intelligence overhang", where the raw intelligence that exists in these LLMs can't fully unfold due to modality & context window limitations. These limitations mean that Claude Code et al. don't yet show the effects on the economy and world as a whole that many would have expected from AGI. But I'd argue it makes sense to decouple the actual "intelligence" from the limiting way in which it's currently bound to interact with the world - even if, as some might correctly argue, modality & context window are just an inherent property of LLMs. Because this is an important detail about the state of things that, I suppose, is neither part of most of the definitions people gave for AGI in the past, nor of the vague intuitions they had about what the term means.

^{^}
as opposed to, say, understanding video, including sound, and including a sense of time. (This is not to say that vision is necessary for general intelligence, of course; but that's kind of my whole point: the general intelligence is already there, it's just that the modality + context restrictions mean AI is still much less effective at influencing the world in the way that a "naively" imagined AGI would)

[-]Vladimir_Nesov4mo182

I think jaggedness of RL (in modern LLMs) is an obstruction that would need to be addressed explicitly, otherwise it won't fall to incremental improvements or scaffolding. There are two very different levels of capability, obtained in pretraining and in RLVR, but only pretraining is somewhat general. And even pretraining doesn't adapt to novel situations other than through in-context learning, which only expresses capabilities at the level of pretraining, significantly weaker than RLVR-trained narrow capabilities.

Scaling will make pretraining stronger, but probably not sufficiently to matter for this issue, and natural text data will only last for another step of improvement similar to what happened in 2023-2025 (in pretraining only, ignoring RLVR). If RL doesn't get more general, it'll probably remain useless for improving general capabilities outside the skills trained with RLVR. Capabilities will remain jagged, with gaps that have to be addressed manually by changing the training data.

This could change within a few years, possibly even faster if LLMs can be RLVRed to become able to RLVR themselves, though that won't necessarily work. Or via next token prediction RLVR that makes pretraining stronger without requiring more natural text data, but this probably needs much more compute even if it works in principle, so might also take 5-10 years, to uncertain capability level results.

[-]cubefox4mo2-6

So, in some sense, I'd think that there's an "intelligence overhang", where the raw intelligence that exists in these LLMs can't fully unfold due to modality & context window limitations.

Another missing piece is research taste or curiosity. Of the sort you would need to come up with ideas for new papers.

[-]Ashe Vazquez Nuñez4mo10

Does "multi-modality" include features like having a physical world model, such that it could input sensible commands to robot body, for instance?

[+][comment deleted]4mo1-2

[-]silentbob1y*330

One thing that confused me about transformers is the question of when (as in, after how many layers) each embedding "flips" from representing the original token to finally representing the prediction of the next token.

By now, I think the answer is simply this: each embedding represents both at the same time (and more). For instance, in GPT3 there are 12,288 embedding dimensions. At first I thought that all of them initially encode the original token, and after going through all the layers they eventually all encode the next token, and somewhere in the layers between this shift must happen. But what, upon some reflection, makes much more sense would be something very roughly like, say:

some 1000 dimensions encode the original token
some other 1000 dimensions encode the prediction of the next token
the remaining 10,288 dimensions encode information about all available context (which will start out "empty" and get filled with meaningful information through the layers).

In practice, things are of course much less clean, and probably most dimensions will have some role in all these things, to different degrees, as of course all of this is learned through gradient descent and hence will be very noisy and gradual. Additionally, there's the whole positional encoding thing which is also part of the embeddings and makes clear distinctions even more difficult. But the key point remains that a single embedding encodes many things, only one of which is the prediction, and this prediction is always there from the beginning (when it's still very superficial and bad) and then, together with the rest of the embedding, gets refined more and more throughout the layers.

Another misconception I had was that embedding and unembedding are very roughly symmetric operations that just "translate" from token space to embedding space and vice versa^[1]. This made sense in relation to the initial & naive "embeddings represent tokens" interpretation, but with the updated view as described above, it becomes clear that unembedding is rather an "extraction" of the information content in the embedding that encodes the prediction.

One piece of evidence for this updated view is that this paper (thanks to Leon Lang for the hint) found that "Zero layer transformers model bigram statistics". So, indeed, embedding + unembedding alone already perform some very basic next-token prediction. (Admittedly I'm not sure if this is only the case when the transformer is trained with zero layers, or also in, say, GPT3, when during inference you just skip all the layers)

I would guess that transformer-experienced people (unless they disagree with my description - in that case, please elaborate what I'm still getting wrong) will find all of this rather obvious. But for me, this was a major missing piece of understanding, even after once participating in an ML-themed bootcamp and watching all the 3Blue1Brown videos on transformers several times, where this idea either is not directly explained, or I somehow managed to consistently miss it.

^{^}
Of course, this is not entirely true to begin with because the unembedding yields a distribution rather than a single token. But my assumption was that, if you embed the word "Good" and then unembed the embedding immediately, you would get a very high probability for "Good" back when in practice (I didn't verify this yet) you would probably obtain high probabilities for "morning", "day" etc.

[-]ryan_greenblatt1y142

Awkwardly, it depends on whether the model uses tied embeddings (unembed is embed transpose) or has separate embed and unembed matrices. Using tied embedding matrices like this means the model actually does have to do a sort of conversion.

Your discussion seems mostly accurate in the case of having separate embed and unembed, except that I don't think the initial state is like "1k encode current, 1k encode predictions, rest start empty". The model can just directly encode predictions for an initial state using the unembed.

[-]simulus1y112

There has actually been some work visualizing this process, with a method called the "logit lens".

The first example that I know of: https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

A more thorough analysis: https://arxiv.org/abs/2303.08112

[-]Logan Riggs1y42

You can learn a per-token bias over all the layers to understand where in the model it stops representing the original embedding (or a linear transformation of it) like in https://www.lesswrong.com/posts/P8qLZco6Zq8LaLHe9/tokenized-saes-infusing-per-token-biases

You could also plot the cos-sims of the resulting biases to see how much it rotates.

[-]Gurkenglas1y40

I didn't verify this yet

Do it! I bet slightly against your prediction.

[-]silentbob10mo263

One super useful feature of Claude that some may not know about:

Claude is pretty good at creating web apps via artifacts
You can run and use these web apps directly in the Claude UI
You can publish and share these artifacts directly with others

As far as I can tell, the above is even available for non-paying users.

Relatedly: browser bookmarklets can be pretty useful little tools to reduce friction for recurring tasks you do in your browser. It may take <5 minutes to let Claude generate such bookmarklets for you.

You can also combine these two things, such as here: https://claude.ai/public/artifacts/9c58fb4a-5fae-48ce-aed3-60355bfd033e

This is a web app built and hosted by Claude which creates a customized browser bookmarklet that provides a simple text-to-speech feature. It works like this:

customize the configuration on the linked page
drag the "Speak Selection" button into your bookmarks bar
from then on, on any website, when you mark text and then click the bookmark (or, after having clicked on it once, you can also use the defined hotkey instead), the selected text will be read out to you

Surely there are browser plugins that provide better TTS than this, but consider it a little proof of concept. Also this way it's free, friction-less, requires no account etc. Claude also claimed that, when using Edge or Safari, higher quality system voices may be available, but I didn't look into this.

Some other random things that can be done via bookmarklets:

a button cycling through different playback speeds of all videos on the current website, in case you sometimes interact with video players without such a setting in their UI
if you're fine with having some API key in your bookmarklet, you can automate all kinds of, say, LLM calls
- If you're using Chrome and have enabled the local Gemini nano AI, you can even use that in your bookmarklets without any API key being involved (haven't tried this yet)
start & show a 5 minute timer in the corner of the page you're on
show/hide parts of the page, e.g. comments on a blog, Youtube recommendations
highlight-for-screenshot overlay: enable temporarily drawing on the page to highlight things to then take screenshots; maybe slightly lower friction than having to use a separate paint app for that. Usable here (relevant keys after activating: Enter to leave drawing mode, ESC to close overlay, 1-9 to change marker size).
inline imperial<->metric unit converter

For some of these, a browser plugin or tampermonkey script or so may be preferable - but beware fake alternatives. If you just think "I could do X instead" but never actually do it, then maybe just creating a bookmarklet may be the better option after all, even if it's not the most elegant solution.
Happy to hear about your use cases!

[-]title2210mo70

Cool! I didn't know about bookmarklets. I knew Gemini would host little pages and apps made in canvas, so I played around a bit to see how different AI's handle it.
Gemini is like your Claude example. Here is a 5 min timer bookmarklet
https://g.co/gemini/share/73048c89f2f2
Perplexity lab made a bookmarklet and a nice html explainer, but sharing is a little less intuitive. There's a tab for "app" and at the bottom of that page a button to share the url. Here is a RNG (code works but the "drag the button" isn't (and I was just looking for proof of concept)
Random Number Generator Bookmarklet - Free Tool
Chatgpt has canvas like Gemini. It should work the same but in my 15 min of testing the shared page hangs up and the bookmarklet doesnt seem to work. But I suppose it could be my work PC is breaking it somehow. Anyway, here is an attempted "read mode" for webpages:
ChatGPT - Read Mode Static
Grok's canvas is Grok Studio. seems like it only can be summoned in chat, like Claude. Doesnt seem like you can share the app. Grok suggested:
To share publicly, host it on a free platform like GitHub Pages, Glitch, or Replit (upload the file and get a public URL).
I can share the chat that generated the bookmarklet though. Also, it doesn't seem to work but again, proof of concept:
Mute all tabs
https://grok.com/share/c2hhcmQtNA%3D%3D_e0f91d33-7aba-4c8a-942b-db570b049536

Just to see if these bookmarklets were even possible I re-tried in Gemini
-Read mode works: https://g.co/gemini/share/dc55070e0dc4
-RNG, app works "drag to bookmarks" doesnt: https://g.co/gemini/share/024d865cbbae
-Mute all tabs works: https://g.co/gemini/share/5dba86dee603

[-]Jakub Halmeš10mo52

A couple of weeks ago, I was surprised to find out that you can create artifacts that call the Claude API. Silly example: Chat app with Claude always responding with capitalized text.

[-]Joseph Miller10mo43

Wow that feels almost cruel! Seems to change the Claude personality substantially?

[-]faul_sname10mo20

Claude can also invoke instances of itself using the analysis tool (tell it to look for self.claude).

[-]silentbob4mo246

Often, qualitative differences turn out to be quantitative, especially in AI progress. As The Bitter Lesson pointed out in 2019, jumps in capabilities often don't need some breakthrough or human ingenuity, but merely (much) more of the same, that is, scaling up the compute. And so we went from GPT2, which could produce English text with mostly flawless grammar but not much more, to the multilingual GPT3.5 that could write entire essays, to later models that are coming for most white collar jobs.

This naturally raises the question which other limitations exist in AI that seem qualitative, but end up being pretty much solved by the same thing but bigger. I wonder about three areas in particular:

Continual learning
Reliability & hallucinations
Multi-modality much closer to the human experience (something like audio-visual with depth- and time perception)

For all of these, it's tempting to claim that they require some big breakthrough or entirely different approach than LLMs, and that the default would be that these current limitations will pose natural upper bounds to the impact of LLMs on our world. And I can well imagine that certain breakthroughs could greatly accelerate progress in these areas. But I also can't help but suspect that even without major breakthroughs, we'll inevitably see serious progress on these fronts anyway.

Continual learning: context window sizes didn't see the rapid progress of some other areas & benchmarks, but even then, today's frontier models have ~10x the context window of 2023. It's not the primary thing labs are optimizing, but it seems overwhelmingly likely to me that algorithmic + hardware progress will lead to larger context windows of the years. And if we do reach 10M or 100M token context windows eventually, I wouldn't be surprised if that (combined with other capability improvements) will be sufficient to make in-context learning capable enough to mostly alleviate the need for true continual learning for most economically valuable purposes. Sure, if somebody figures out true scalable & robust continual learning, then that's an even bigger deal^[1]. But I'd argue that even if this for whatever reason does not come to pass, merely scaling up context window sizes could eventually be sufficient to surpass the "context persistence advantages" of humans.^[2]

Reliability & hallucinations: some people assume that LLMs will always hallucinate and it will take a fundamentally different approach to overcome this. Maybe they're right, but at least in agentic coding we see that if you get the feedback loops right and "tether the model" to some verifiable part of reality, hallucinations mostly become a non-issue. It's unclear to me how far this will actually work & scale in other areas, and Sam Altman's prediction from 2023 that two years from then "we won't still talk about" hallucinations certainly turned out to be incorrect. But I wouldn't be surprised if some relatively marginal changes, such as forms of embodiment^[3] or best-of-n style answers, or whatever other surprisingly simple strategy will be identified in the meantime, end up increasing reliability greatly.

Multi-Modality: in principle, a larger context window might allow just providing an LLM with 100s of images representing some form of livestream from a camera (or two), and appropriate training or reasoning might allow it to "perceive" movement. On the one hand, I'd think that it's a huge disadvantage for the LLM if the "time modality" is not properly represented in the way its inputs are tokenized^[4]. But on the other hand, it still seems conceivable that even such a suboptimal encoding of movement as "100 separate tokenized still images" could be handled by more advanced LLMs well enough to basically solve current limitations of LLM perception^[5].

I'm not claiming that any of this is what is going to happen. Multi-modality in particular seems like something labs could expand a lot if it was a priority, but they just happen to focus on other areas that are more lucrative on the current margin. Either way, the point of this post is just to point out that I do think that these developments may be a bit of a lower bound of AI progress. Even if no major breakthroughs occur, I'd still assume we eventually end up

with in-context learning capable enough to surpass humans in many areas where we would currently assume continual learning to be required
with fewer and fewer hallucinations in many areas
and with AI models that can perceive the world in very similar ways to us, in so far as that's helpful for the area they're deployed in (and in many ways that may go way beyond the limits of human perception)

^{^}
And to be fair, my best guess is that continual learning will see some breakthroughs in the next 1-3 years and will essentially get solved.
^{^}
Somewhat relatedly to this, I also get the impression that much of what's currently happening in the AI coding landscape (around skills, MCPs, agents/claude.md files, memory, context management...) is to some degree "overfitting" on the current margin of AI capability and will in future generations get obsolete once LLMs become better at building & persisting meaningful context themselves dynamically. We're in this fun phase, where humans can still teach LLMs a lot to make them more useful, but I highly doubt this phase will last very long.
^{^}
My thought here being that some form of embodiment "nails" the AI to reality and (directionally) prevents it from spiraling out of control in strange failure modes; of course, it might still turn psychotic for various reasons, but having a constant stream of "reality" almost certainly might have some grounding influence compared to its current reality that largely consists of its own thoughts, system prompts and the ramblings of its conversation partner.
^{^}
E.g., CNNs seem conceptually nice in that they encode a certain prior about the modality of images, in that neighboring pixels tend to be more relevant for each other than more distant pixels. Similarly and reversed, providing frames of a video as entirely separate images just seems lacking, as the temporal connection isn't really encoded, but just kind of "interpreted into it" after the fact.
^{^}
To name one example of the limitations I mean here: if you're working on a website and add some subtle animations to improve UX, this is something today's coding agents have a very hard time testing. They can generally use browsers, click around, look at different screenshots, but this usually happens "one screenshot at a time" and does not include animations. They can still implement animations, and often do a good job at that, but they're typically doing this blindly. Any human, on the other hand, who would use this website, would instantly and automatically perceive animations, and notice when they're off in any considerable way.

[-]Vladimir_Nesov4mo40

None of this helps with automatically acquiring deep skills like playing good chess or fluency in a novel topic of math, and so these aren't the stright lines on graphs directly relevant to crossing the AGI threshold, full automation of civilization.

Humans don't know how to automate learning of arbitrary deep skills in an AI that only come up post-deployment, but can manually add them with RLVR at training time, by developing RL environments, graders, and tasks. AI might automate this process not by doing what humans couldn't and inventing algorithmic advancements for low level acquisition of deep skills, but instead by merely being smart and skilled enough to do all the same things that humans are currently doing to make it work, "manually". So in principle AI might become able to automatically acquire deep skills if it's capable enough at routine AI R&D, even if it doesn't have the capability to acquire deep skills at a low level, the way humans do, and doesn't have the capability to invent substantial algorithmic innovations that humans haven't invented yet. Some of the straight lines on graphs are relevant to when this might happen, and so indirectly they are relevant to crossing the AGI threshold.

I don't think in-context learning or even true continual learning with anything like the current methods can automate acquisition of deep skills at a low level, because only RLVR currently works for that purpose, context persistence is essentially unrelated. But these things might get AIs to the level of capability where they can do the same things as the humans who set up the ingredients for task-specific RLVR.

[-]Brendan Long4mo42

Even without with longer contexts, LLMs being able to use notes effectively seems like the kind of skill issue that will likely improve over time with or without algorithmic breakthroughs. 1M token context is already way more than a human can keep track of without notes.

[-]silentbob3mo51

True, it's possible larger context windows aren't even needed and 1M is sufficient for the majority of our economy to get automated.

I also think it's easy to underestimate how much context humans actually gather over the years though. E.g. in my job there's a huge amount of information I picked up over time. And I never fully know in advance what subset of that information I might need on any given day. It would be futile to even try to write down everything that I know, because much of that knowledge is latent/fuzzy/hard to put in words/seems irrelevant but isn't necessarily.

To list a few such things:

Company culture and structure
Teams and responsibilities
Many dozens of co-workers, their tenure, skills, personalities, common memories, what they look like, their voices
Dozens of tools, how to use and navigate them, when and why to use them, when and why they were introduced
A huge code base, or at least many many bits and pieces of it
The product(s) including future roadmap and past development, some known issues and limitations, their design
Context about how users interact with our software
Our competition and how we relate to them

I'd assume that my visual knowledge alone (what products, tools, people, logos etc look like) could fill a significant part of a 1M context window (given the current state of the tech).

I recently tried to compile one really thorough readme for LLMs about one project I had worked on. I think it ended up at around 50k tokens, but it was very far from complete, as I have so much latent knowledge about it that I can't just easily export on demand - it just lives somewhere in my brain, stashed away until some situation arises where I actually need it. That said, it's possible that "the essence" of that knowledge could be compressed to, say 10-20% the amount of tokens, which would indeed make your argument very plausible.

[-]quetzal_rainbow3mo20

I think this is misunderstanding of the bitter lesson. Bitter lesson says that instead of handcrafted ontology you need method to leverage large amount of data and compute to discover ontology. Transformer architecture is human ingenuity here.

[-]silentbob5mo155

Egoism has a bad reputation, but I think that doesn't do it justice. Some degree of egoism is likely very helpful, as it's a form of ensuring available local knowledge is taken into account. If people were not at least mildly egoistic, a great deal of local knowledge would be ignored, leading to everyone supposedly helping others in not-actually-helpful ways.

What I think is much more harmful overall is the distinction between valuing public goods^[1] at least somewhat vs not valuing them at all in their decision-making. This is something I've in particular seen in some B2C companies I've been involved with: when things are going well for them, they're proud of the value they produce for the public (as in, typically, their paying + non-paying users). But when the market gets tough and their growth/existence is at risk, they often very quickly stop caring about public goods entirely and start making very one-sided trade-offs that are (supposedly) beneficial for them while often being incredibly annoying to many users, when other, similarly-beneficial-to-them solutions might exist that don't come at the expense of users. Two examples:

A software company I once worked at noticed that there was some account sharing between paying users. Their solution was to aggressively limit active logins to a single device / browser, so whenever you log in somewhere, you are logged out everywhere else. My best guess was that this may have increased sales by perhaps 0.3% at best, while being annoying to a large fraction of users. (Ultimately, I think it was even net negative for the company itself, as annoyed users are more likely to churn, but that's a topic for another time)
Staying a bit vague here, but in another instance, a feature that was frequently used by hundreds of thousands of non-paying users, and which caused negligible maintenance effort, was entirely removed because it didn't measurably drive conversions

So, the problem in such cases is not so much that the company cares about their own growth, the problem is when they completely disregard the (potential for positive) externalities^[1], compared to only mostly disregarding it, but at least having it represented in their model with a non-zero weight.

There are surely different reasons for why such absolute disregard for public goods can occur. Some speculation:

zero-sum thinking where one implicitly assumes that the only variable in question is "benefits me" vs "benefits the public" and not seeing that these can move independently
valuing something a bit vs not at all can increase the complexity of your decision and add to mental exhaustion^[2]
diffusion of responsibility, where individuals inside the company may care about providing value to the public, yet nobody feels entirely responsible to defend this perspective in big decisions
and, of course, Goodhart's law and optimizing what you can measure, which often does not include the real value you provide, but instead mostly easy-to-hack superficial metrics and costs/benefits to yourself

I don't pretend to have any solution for this. But my impression is that some of the decision-making, at least in the companies I've seen, tends to be highly path-dependent, and a good argument or suggestion made to the right people at the right point in time can make a huge difference. So I guess, even if this approach doesn't scale all that well, having well-meaning individuals within companies occasionally speak up and make productive proposals could move some needles.

^{^}
I can imagine that I'm not using "public goods" and "externalities" in precisely the ways they're usually used. I hope the post makes some sense anyway. If you know of any simple ways to phrase things more precisely, please let me know.
^{^}
I suspect this is why even many people who care about animals and dislike factory farming prefer to not think about the topic at all rather than making decisions case by case and trading off their comfort vs how much harm is caused. E.g., when you eat at a restaurant with a lot of veggy offers, it would (for most people) be very easy to eat something without meat. Whereas when friends invite you over and cook something with meat, it would be much more costly/unpleasant to refuse eating it. Still, I know only few people who are "vegetarian when it's easy", yet I know many people who dislike factory farming, but give it practically 0 weight in their decisions nonetheless.

[-]gjm5mo70

Is it actually true that egoism, in the sense of "some degree of egoism" or "at least mildly egoistic", has a bad reputation?

My impression is that (1) almost everyone cares at least a bit about random other people but cares a lot more about themself, and (2) almost everyone is aware of #1 and doesn't see it as particularly bad. If you call someone altruistic you aren't generally claiming that they don't give higher priority to their own interests than others' at all, only that they care more about others relative to themselves than is usual.

I agree that it's more broadly accepted when companies care scarcely at all about the interests of others --we largely accept corporate behaviour that would be universally regarded as sociopathic if done by an individual -- and that this is a bad thing.

[-]silentbob5mo20

Yeah, fair enough. My impression has been that some people feel guilty about caring about themselves more than about others, or that it's seen as not very virtuous. But maybe such views are less common (or less pronounced) than the vibes I've often picked up imply. :)

[-]faul_sname5mo42

This suggests a course of action if you work at a company which can have significant positive externalities and cares, during good times, more than zero about them: during those good times, create dashboards and alerts with metrics which correlate with those externalities, to add trivial friction (in the form of "number go down feels bad") to burning the commons during bad times.

[-]Vladimir_Nesov5mo30

The egoism/altruism distinction makes sense in the framing of boundaries and scopes of optimization for different preferences. If you optimize things that are centrally yours (within the boundary of what is you or yours) according to values of egoism, and optimize things that are centrally not yours (outside this boundary) but within some circle of concern according to values of altruism, this is a healthy state of affairs. (Enemies could form yet another scope of optimization, with values of coercion and such.)

This works for interventions that either mostly affect things on one side of the boundary or the other, but breaks down when they significantly affect both, when there are tradeoffs. An important failure mode (perhaps downstream of having to deal with situations that have tradeoffs) is when values appropriate for one scope of optimization get directly applied to another, when egoism or coersion target your circle of concern, or altruism gets turned back on yourself (or enemies). Or when the boundaries get redrawn in a strange way (someone with a lot of power could consider quite a lot of other people "theirs", applying egoism, optimizing for effectiveness without respect for autonomy).

[-]silentbob9mo111

Some quick thoughts on vibe coding:

it turns you from a developer into more of a product manager role
- but the developers you manage are a) occasionally stupid/unwise and b) extremely fast and never tired
this makes it relatively addictive, because feedback cycles are much shorter than for a "real" product manager, who often has to wait for weeks to see their wishes turn into software, and you have a strong element of randomness in your rewards, with things sometimes turning out surprisingly well one-shot, but sometimes not at all
It can also lead to laziness, as it's very tempting to getting used to "just letting the AI do it" even in not primarily vibe-coded projects, instead of investing one's own brainpower
AI agents tend to never/rarely talk back or tell you that something is a bad idea or doesn't work well with the current architecture; they just do things as best as currently possible. This form of local optimization quickly runs into walls if not carefully mitigated by you.
- Part of the problem is that by default the AI has extremely little context and knows little about the purpose, scope and ambition of your project. So when you tell it "do X", it typically can't tell whether you mean "do X quick and dirty, I just wand the results asap" or "lay out a 10 step plan to do X in the most sustainable way possible that allows us to eventually reach points Y and Z in the future". If it gets things wrong in either direction, that tends to be frustrating, but it can't read your mind (yet).
AI agents that are able to run unit tests and end-2-end tests and see compiler errors are so much more useful than their blind counterparts
If you need some particular piece of software but are unsure if current AIs will be able to deliver, it might make sense to write a detailed, self-contained and as-complete-as-possible specification of it, to then throw it at an AI agent whenever a new model (or scaffolding) comes out. Github Copilot with GPT5 was able to do many more things than I would have imagined, with non-trivial but still relatively limited oversight.
- I haven't tried yet if just letting it to its thing, saying only "continue" after each iteration, may be sufficient. Maybe I put more time into guiding it than would actually be necessary.
- That being said: writing a self-contained specification that contains your entire idea of something with all the details nailed down such that there is little room for misunderstandings is surprisingly hard. There are probably cases where just writing the software yourself (if you can) takes less time than fully specifying it.
- That being said, "writing down a specification" can also happen interview-style using an AI's voice mode, so you can do it while doing chores.

[-]silentbob1y90

For a long time, I used to wonder what causes people to consistently mispronounce certain words even when they are exposed to many people pronouncing them correctly. (which mostly applies to people speaking in a non-native language, e.g. people from continental Europe speaking English)

Some examples that I’ve heard from different people around me over the years:

Saying “rectangel” instead of “rectangle”
Saying “pre-purr” (like prefer, but with a p) instead of “prepare”
Saying something like, uhh, “devil-oupaw” instead of “developer”
Saying “leech” instead of “league”
Saying “immu-table” instead of “immutable”
Saying "cyurrently" instead of "currently"

I did, of course, understand that if you only read a word, particularly in English where pronunciations are all over the place and often unpredictable, you may end up with a wrong assumption of how it's pronounced. This happened to me quite a lot^[1]. But then, once I did hear someone pronounce it, I usually quickly learned my lesson and adapted the correct way of saying it. But still I've seen all these other people stick to their very unusual pronunciations anyway. What's up with that?^[2] Naturally, it was always too awkward for me to ask them directly, so I never found out.

Recently, however, I got a rather uncomfortable insight into how this happens when a friend pointed out that I was pronouncing "dude" incorrectly, and have apparently done so for all my life, without anyone ever informing me about it, and without me noticing it.

So, as I learned now, "dude" is pronounced "dood" or "dewd". Whereas I used to say "dyood" (similar to duke). And while I found some evidence that dyood is not completely made up, it still seems to be very unusual, and something people notice when I say it.

Hence I now have the, or at least one, answer to my age-old question of how this happens. So, how did I never realize? Basically, I did realize that some people said "dood", and just took that as one of two possible ways of pronouncing that word. Kind of, like, the overly American way, or something a super chill surfer bro might say. Whenever people said "dood" (which, in my defense, didn't happen all that often in my presence^[3]) I had this subtle internal reaction of wondering why they suddenly saw the need to switch to such a heavy accent for a single word.

I never quite realized that practically everyone said "dood" and I was the only "dyood" person.

So, yeah, I guess it was a bit of a trapped prior and it took some well-directed evidence to lift me out of that valley. And maybe the same is the case for many of the other people out there who are consistently mispronouncing very particular words.

But, admittedly, I still don't wanna be the one to point it out to them.

And when I lie awake at night, I wonder which other words I may be mispronouncing with nobody daring to tell me about it.

^{^}
e.g., for some time I thought "biased" was pronounced "bee-ased". Or that "sesame" was pronounced "see-same". Whoops. And to this day I have a hard time remembering how "suite" is pronounced.
^{^}
Of course one part of the explanation is survivorship bias. I'm much less likely to witness the cases where someone quickly corrects their wrong pronunciation upon hearing it correctly. Maybe 95% of cases end up in this bucket that remains invisible to me. But still, I found the remaining 5% rather mysterious.
^{^}
Maybe they were intimidated by my confident "dyood"s I threw left and right.

[-]Viliam1y42

I use written English much more than spoken English, so I am probably wrong about the pronunciation of many words. I wonder if it would help to have a software that would read each sentence I wrote immediately after I finished it (because that's when I still remember how I imagined it to sound).

EDIT: I put the previous paragraph in Google Translate, and luckily it was just as I imagined. But that probably only means that I am already familiar with frequent words, and may make lots of mistakes with rare ones.

[-]silentbob10mo80

Using coding agents gave me a new appreciation for the Jevons paradox, a concept that received a lot of attention earlier this year when DeepSeek R1's release in January coincided with a sudden drop in Nvidia's stock price, possibly as the supposed efficiency gains of the model made many traders assume this would lead to a decrease in hardware demand. The stock eventually bounced back though, with Jevons paradox being cited as one of the reasons, as it predicted that efficiency gains would lead to an increase in hardware demand rather than a decrease.

I recently realized that Github Copilot's agent mode with GPT5 is way more capable than I would have imagined, and I started using it a lot, starting a bunch of small to medium-sized projects. I'd just start with an empty directory, write a projectOutline.md file to describe what I ultimately want to achieve, and let the agent take it from there (occasionally making some suggestions for refactorings and writing more unit + end2end tests, to keep things stable and scalable). This way it would just take me something like 5-50 prompts and a few hours of work to reach an MVP or prototype state in these projects that otherwise would have taken weeks.

The naive reaction to this would be to assume I would be much faster with my coding projects and hence would have to spend less time on coding. But, as Jevons paradox would predict, the opposite was the case - it just caused me to work on way more projects, many that I otherwise would never have started, and I spent much more time on this than I would have otherwise (over a given time frame). So even though coding became much faster (I may be wrong, but I'm pretty confident this is true in net dev time despite some contrary evidence, and I'm extremely certain it's true in calendar time, as my output increased ~30x basically overnight - not because my coding speed was that slow beforehand, but because I never prioritized it as it wasn't worth doing over other activities), the total time I spent programming increased a lot.

This will probably get old quickly (with the current frontier models), as with most projects I might hit a "wall" where the agents don't do a great job of further iterative improvements, I suppose. But either way, it was interesting to experience this first-hand, how "getting faster at something" caused me to spend much more, rather than less, time on it, as obvious as this effect may be in hindsight.

[-]silentbob6mo30

Reality itself doesn't know whether AI is a bubble. Or, to be more precise: whether a "burst-like event"^[1] will happen or not is - in all likelihood, as far as I'm concerned - not entirely determined at this point in time. If we were to "re-run reality" a million times starting today, we'd probably find something that looks like a bursting bubble in some percentage of these and nothing that looks like a bursting bubble in some other percentage - and the rest would be cases where people disagree even in hindsight whether a bubble did burst or not.^[2]

When people discuss whether AI is a bubble, they often frame this (whether deliberately or not) as a question about the current state of reality. As if you could just go out into the world and do some measurements, and if you find out "yep, it's a bubble", then you know for sure that this bubble must pop eventually.^[3] And while there certainly are ways to measure properties of bubbliness of different parts of the economy, it could well be that what looks like a bubble today may either slowly "deflate" rather than burst, or reality around it catches up eventually, justifying the previously high valuations.

Uncertainty is sometimes conceptually split into two parts: epistemic (our limited knowledge) and aleatory (fundamental uncertainty in reality itself). My claim here is basically just that, when it comes to bubbles bursting in the future, the aleatory component is not 0, and we shouldn't treat it as such. In other words, there is an upper limit in how certain a rational person can become at any point in time on whether an AI bubble burst event will occur or not. Sadly, knowing where that limit is is in itself uncertain, which makes all of this not very actionable. Still, it seems important^[4] to acknowledge that we can't just expect that doing any amount of research today will lead to certainty on such questions, as reality itself probably isn't fully certain on the question at hand.

Ultimately, whether any burst-like event will eventually occur depends on a complex interplay of market participants' expectations. Any current bubble-like properties of the AI sector definitely play a big role in shaping these expectations and thereby the outcome - but even then, these expectations are highly path-dependent, and I find it very unlikely that the current state of the world fully determines how they will, in fact, develop.

^{^}
Of course, you can distinguish between "X has bubble-like properties right now" and "The X bubble will eventually burst". You could believe something "is a bubble" in some sense without having to also believe that this bubble will burst. In public discourse though, "X is a bubble" is often, whether explicitly or implicitly, equated with "the X bubble will burst". My take here mostly focuses on predictions of future bursts rather than claims about present bubble-like properties.
^{^}
I make no claims about the magnitude of these different probabilities; this is rather a meta argument about how these discussions are often framed, and that that can be misleading. It could of course still be true that reality is determined to a degree that the probability of a future bubble burst event gets ~arbitrarily close to 0% or 100% (even though I'd be surprised if that were currently the case)
^{^}
Not everyone discusses it like that or has this model of the world, but it's very easy to walk away with this impression when following the public discourse around the topic.
^{^}
Is it actually important? I'm not sure. Perhaps, even epistemic uncertainty is "enough" if you take it seriously. Maybe the idea of aleatory uncertainty in this context is just a useful intuition pump to resist the urge to become highly confident in one's judgment about the outcome of a complex process. 🤷

[-]Dagon6mo*50

It may be unknown, or even unknowable by any real-world agent. It's still not necessarily undetermined by the universe - I find it pretty likely that the universe is, in fact, deterministic.

Your underlying point is correct, though. Because human behavior is anti-inductive (people change their behavior based on their predictions of others' predictions), a lot of these kinds of questions are chaotic (in the fractal / James Gleik sense).

[-]gjm5mo20

So far as I can tell, the most plausible way for the universe to be deterministic is something along the lines of "many worlds" where Reality is a vast superposition of what-look-to-us-like-realities, and if the future of AI is determined what that means is more like "15% of the future has AI destroying all human value, 10% has AI ushering in a utopia for humans, 20% has it producing a mundane dystopia where all the power and wealth is in a few not-very-benevolent hands, 20% has it improving the world in mundane ways, and 35% has it fizzling out and never making much more change than it already has done" than like "it's already determined that AI will/won't kill us all".

(For the avoidance of doubt, those percentages are not serious attempts at estimating the probabilities. Maybe some of them are more like 0.01% or 99.99%.)

[-]silentbob1y30

After first learning about transformers, I couldn't help but wonder why on Earth this works. How can this totally made-up, complicated structure somehow end up learning how to write meaningful text and having a mostly sound model of our world?

(tl;dr: no novel insights here, just me writing down some thoughts I've had after/while learning more about neural nets and transformers.)

When I once asked someone more experienced, they essentially told me "nobody really knows, but the closest thing we have to an answer is 'the blessing of dimensionality' - with so many dimensions in your loss landscape, you basically don't run into local minima but the thing keeps improving if you just throw enough data and compute at it".

I think this makes sense, and my view on how/why/when deep neural networks work is currently something along the lines of:

there's some (unknown) minimal network size (or maybe rather "minimal network frontier", as with different architectures you end up with different minimal sizes) for every problem you want to solve (for a certain understanding of the problem and when you consider it solved), so your network needs to be big enough to even be able to solve the problem
the network size & architecture also determines how much training data you need to get anywhere
basically, you try to find network architectures such that you encode sensible priors about the modality you're working with that are basically always true while also eliminating a priori-useless weights from your network; this way, the training efforts allow the network to quickly learn important things rather than first having to figure out the priors themselves
- for text, you might realize that different parts of the text refer to each other, so need a way to effectively pass information around, and hence you end up with something like the attention mechanism
- for image detection, you realize that the prior of any given pixel being relevant for any other given pixel is higher the closer they are, so you end up with something like CNNs, where you start looking at low level features, and throughout the layers of the network, allow it to "convert" the raw pixel data successively to semantic data
in theory, you probably could just use a huge feed forward network (as long as it's not so huge as to overfit instead of generalizing to anything useful) and it would possibly end up solving problems in similar ways as "smarter" architectures do (but not sure about this), but you would need way more parameters and way more training data to achieve similar results, much of which would be wasted on "low quality parameters" that could just as well be omitted
so, encoding these modality priors into your network architecture spares you probably orders of magnitude of compute compared to naive approaches
while the bitter lesson makes sense, it maybe under-emphasizes the degree to which choosing suitable network architecture + high quality training data matters?
lastly, the question "which problem you're trying to solve" cannot just be answered on a high level with "I want to minimize loss in next-token prediction", but the exact problem the network solves depends strongly on the training data; loss minimization is a trade-off between all the things you're minimizing, so the higher the amount of rambling, gossip, meaningless binary data and so on in your training data is, the more parameters and training time you'll need just for those, and the less will the network be capable to predict more meaningful tokens.

Related to that last point, I recently worked on a small project where you, as the user, play Pong against an AI. That AI is controlled by a small neural network (something in the order of 2 or 3 hidden layers and a few dozen neurons), initialized randomly, so at first it's very easy for the human to win. While you play, though, the game collects your behavior as training data and constantly trains the neural network, which eventually learns to mirror you. So after a few minutes of playing, it plays very similar to the human and it becomes much harder to beat it.

One thing I noticed while working on this is that the naive approach to training this AI was far from optimal: much of the training data I collected ended up being pretty irrelevant for playing well! E.g., it's much more important how the paddle moves while the ball is closing in, and almost entirely irrelevant what you do right after hitting the ball. There were several such small insights, leading me to tweak how exactly training data is collected (e.g. sampling it with lower probability while the ball is moving away than when it's getting closer), which greatly reduced the time it took for the AI to learn, even with the network architecture staying the same.

Notably, this does not necessarily mean the loss curve dropped more quickly - due to me tweaking the training data, the loss curves before and after doing so related to quite different things. The same loss for higher quality data is much more useful than for noisy or irrelevant data.

There's just so many degrees of freedom in all of this that it seems very likely that, even if there were not hardware advances at all, research would probably be able to come up with faster/cheaper/better-performing models for a long time.

[-]gwern1y*90

for text, you might realize that different parts of the text refer to each other, so need a way to effectively pass information around, and hence you end up with something like the attention mechanism

If you are trying to convince yourself that a Transformer could work and to make it 'obvious' to yourself that you can model sequences usefully that way, it might be a better starting point to begin with Bengio's simple 2003 LM and MLP-Mixer. Then Transformers may just look like a fancier MLP which happens to implement a complicated way of doing token-mixing inspired by RNNs and heavily tweaked empirically to eke out a bit more performance with various add-ons and doodads.

(AFAIK, no one has written a "You Could Have Invented Transformers", going from n-grams to Bengio's LM to MLP-Mixer to RNN to Set Transformer to Vaswani Transformer to a contemporary Transformer, but I think it is doable and useful.)

[-]quetzal_rainbow1y64

I think you would appreciate this post

[-]silentbob2y20

For people who like guided meditations: there's a small YouTube channel providing a bunch of secular AI-generated guided meditations of various lengths and topics. More are to come, and the creator (whom I know) is happy about suggestions. Three examples:

They are also available in podcast form here.

I wouldn't say these meditations are necessarily better or worse than any others, but they're free and provide some variety. Personally, I avoid apps like Waking Up and Headspace due to both their imho outrageous pricing model and their surprising degree of monotony. Insight Timer is a good alternative, but the quality varies a lot and I keep running into overly spiritual content there. Plus there's obviously thousands and thousands of guided meditations on YouTube, but there too it's hit and miss. So personally I'm happy about this extra source of a good-enough-for-me standard.

Also, in case you ever wanted to hear a guided meditation on any particular subject or in any particular style, I guess you can contact the YouTube channel directly, or tell me and I'll forward your request.

Moderation Log

More from silentbob

Curated and popular this week

40Comments