carboniferous_umbraculum

We need a Science of Evals

You refer a couple of times to the fact that evals are often used with the aim of upper bounding capabilities. To my mind this is an essential difficulty that acts as a point of disanalogy with things like aviation. I’m obviously no expert but in the case of aviation, I would have thought that you want to give positive answers to questions like “can this plane safely do X thousand miles?” - ie produce absolutely guaranteed lower bounds on ‘capabilities’. You don’t need to find something like the approximately smallest number Y such that it could never under any circumstances ever fly more than Y million miles.

Replying toAlphaGeometry: An Olympiad-level AI system for geometry

AlphaGeometry: An Olympiad-level AI system for geometry

Hmm it might be questionable to suggest that it is "non-AI" though? It's based on symbolic and algebraic deduction engines and afaict it sounds like it might be the sort of thing that used to be very much mainstream "AI" i.e. symbolic AI + some hard-coded human heuristics?

Replying toCurrent AIs Provide Nearly No Data Relevant to AGI Alignment

Current AIs Provide Nearly No Data Relevant to AGI Alignment

FWIW I did not interpret Thane as necessarily having "high confidence" in "architecture / internal composition" of AGI. It seemed to me that they were merely (and ~accurately) describing what the canonical views were most worried about. (And I think a discussion about whether or not being able to "model the world" counts as a statement about "internal composition" is sort of beside the point/beyond the scope of what's really being said)

It's fair enough if you would say things differently(!) but in some sense isn't it just pointing out: 'I would emphasize different aspects of the same underlying basic point'. And I'm not sure if that really progresses the discussion? I.e. it's... (read more)

Replying toValue systematization: how values become coherent (and misaligned)

Value systematization: how values become coherent (and misaligned)

Newtonian mechanics was systematized as a special case of general relativity.

One of the things I found confusing early on in this post was that systemization is said to be about representing the previous thing as an example or special case of some other thing that is both simpler and more broadly-scoped.

In my opinion, it's easy to give examples where the 'other thing' is more broadly-scoped and this is because 'increasing scope' corresponds to the usual way we think of generalisation, i.e. the latter thing applies to more setting or it is 'about a wider class of things' in some sense. But in many cases, the more general thing is not simultaneously 'simpler' or more economical. I don't think anyone would really say that general relativity were actually simpler. However, to be clear, I do think that there probably are some good examples of this, particularly in mathematics, though I haven't got one to hand.

OK I think this will be my last message in this exchange but I'm still confused. I'll try one more time to explain what I'm getting at.

I'm interested in what your precise definition of subjective probability is.

One relevant thing I saw was the following sentence:

If I say that a coin is 50% likely to come up heads, that's me saying that I don't know the exact initial conditions of the coin well enough to have any meaningful knowledge of how it's going to land, and I can't distinguish between the two options.

It seems to give something like a definition of what it means to say something has a 50% chance. i.e. I... (read more)

So my point is still: What is that thing? I think yes I actually am trying to push proponents of this view down to the metaphysics - If they say "there's a 40% chance that it will rain tomorrow", I want to know things like what it is that they are attributing 40%-ness to. And what it means to say that that thing "has probability 40%". That's why I fixated on that sentence in particular because it's the closest thing I could find to an actual definition of subjective probability in this post.

I have in mind very simple examples. Suppose that first I roll a die. If it doesn't land on a 6, I then flip a biased coin that lands on heads 3/5 of the time. If it does land on a 6 I just record the result as 'tails'. What is the probability that I get heads?

This is contrived so that the probability of heads is

5/6 x 3/5 = 1/2.

But do you think that that in saying this I mean something like "I don't know the exact initial conditions... well enough to have any meaningful knowledge of how it's going to land, and I can't distinguish between the two options." ?

Another example:... (read more)

-1

We might be using "meaning" differently then!

I'm fine with something being subjective, but what I'm getting at is more like: Is there something we can agree on about which we are expressing a subjective view?

[Linkpost] Remarks on the Convergence in Distribution of Random Neural Networks to Gaussian Processes in the Infinite Width Limit

I'm kind of confused what you're asking me - like which bit is "accurate" etc.. Sorry, I'll try to re-state my question again:

- Do you think that when someone says something has "a 50% probability" then they are saying that they do not have any meaningful knowledge that allows them to distinguish between two options?

I'm suggesting that you can't possibly think that, because there are obviously other ways things can end up 50/50. e.g. maybe it's just a very specific calculation, using lots of specific information, that ends up with the value 0.5 at the end. This is a different situation from having 'symmetry' and no distinguishing information.

Then I'm saying OK, assuming you indeed don't mean the above thing, then what exactly does one mean in general when saying something is 50% likely?

Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm

The linked note is something I "noticed" while going through different versions of this result in the literature. I think that this sort of mathematical work on neural networks is worthwhile and worth doing to a high standard but I have no reason to think that this particular work is of much consequence beyond filling in a gap in the literature. It's the kind of nonsense that someone who has done too much measure theory would think about.

Abstract. We describe a direct proof of yet another version of the result that a sequence of fully-connected neural networks converges to a Gaussian process in the infinite-width limit. The convergence in distribution that we... (read more)

A Neural Network undergoing Gradient-based Training as a Complex System

These remarks are basically me just wanting to get my thoughts down after a Twitter exchange on this subject. I've not spent much time on this post and it's certainly plausible that I've gotten things wrong.

In the 'Key Takeaways' section of the Modular Addition part of the well-known post 'A Mechanistic Interpretability Analysis of Grokking' , Nanda and Lieberum write:

This algorithm operates via using trig identities and Discrete Fourier Transforms to map $x, y \to cos (w (x + y)), sin (w (x + y))$ , and then extracting $x + y (mod p)$

And

The model is trained to map $x, y$ to $z \equiv x + y (mod 113)$ (henceforth 113 is referred to as $p$ )

But the casual reader should use caution! It is in fact the case that "Inputs $x, y$ are given as one-hot encoded vectors in $R^{p}$ ". This point is of course emphasized more... (read 384 more words →)

104

Notes on the Mathematics of LLM Architectures

In Thought Experiments Provide a Third Anchor, Jacob Steinhardt wrote about the relative merits of a few different reference classes when it comes to reasoning and making predictions about future machine learning systems. He refers to these reference classes as ‘anchors’ and writes:

There are many other anchors that could be helpful for predicting future ML systems... I am most excited about better understanding complex systems, which include biological systems, brains, organizations, economies, and ecosystems and thus subsume most of the reference classes discussed so far. It seems to me that complex systems have received little attention relative to their germaneness to ML.

I, too, recently became curious about this complex systems ‘anchor’ and... (read 5514 more words →)

From a mathematical point of view, the building and training of a large transformer
language model (LLM) is the construction of a certain function, from some euclidean space to another, that has certain interesting properties. And it may therefore be surprising to find that many key papers announcing significant new LLMs seem reluctant to simply spell out the details of the function that they have constructed in plain mathematical language or indeed even in complete pseudo-code. The latter form of this complaint is the subject of the recent article of Phuong and Hutter [1]. Here, we focus on one aspect of the former perspective and seek to give a relatively ‘pure’ mathematical description

Some Notes on the mathematics of Toy Autoencoding Problems

If the trajectory of the deep learning paradigm continues, it seems plausible to me that in order for applications of low-level interpretability to AI not-kill-everyone-ism to be truly reliable, we will need a much better-developed and more general theoretical and mathematical framework for deep learning than currently exists. And this sort of work seems difficult. Doing mathematics carefully - in particular finding correct, rigorous statements and then finding correct proofs of those statements - is slow. So slow that the rate of change of cutting-edge engineering practices significantly worsens the difficulties involved in building theory at the right level of generality. And, in my opinion, much slower than the rate at which... (read 1754 more words →)

Behaviour Manifolds and the Hessian of the Total Loss - Notes and Criticism

Anthropic's recent mechanistic interpretability paper, Toy Models of Superposition, helps to demonstrate the conceptual richness of very small feedforward neural networks. Even when being trained on synthetic, hand-coded data to reconstruct a very straightforward function (the identity map), there appears to be non-trivial mathematics at play and the analysis of these small networks seems to providing an interesting playground for mechanistic interpretability.

While trying to understand their work and train my own toy models, I ended up making various notes on the underlying mathematics. This post is a slightly neatened-up version of those notes, but is still quite rough and un-edited and is a far-from-optimal presentation of the material. In particular, these notes... (read 3365 more words →)

A brief note on Simplicity Bias

In this note I will discuss some computations and observations that I have seen in other posts about "basin broadness/flatness". I am mostly working off the content of the posts Information Loss --> Basin flatness and Basin broadness depends on the size and number of orthogonal features. I will attempt to give one rigorous and unified narrative for core mathematical parts of these posts and I will also attempt to explain my reservations about some aspects of these approaches. This post started out as a series of comments that I had already made on the posts, but I felt it may be worthwhile for me to spell out my position and give... (read 1682 more words →)

Notes on Learning the Prior

The idea of 'simplicity bias' is a popular one in informal discussions about the functions that neural networks implement. I recently tried to think a little bit more about what this meant. This brief note was written mostly for my own benefit, but could be of benefit to others. Having said that, it is not unlikely that I have made significant errors, and if so then I welcome all good faith attempts to point them out.

It seems like it is not uncommon for a surface-level or informal understanding of the idea to have come from Chirs Mingard's medium posts Neural networks are fundamentally (almost) Bayesian and Deep Neural Networks are biased, at... (read 1131 more words →)

An observation about Hubinger et al.'s framework for learned optimization

This post was written to fulfil requirements of the SERI MATS Training Program.

One of the goals in writing this post is to arrive at an explanation of a particular set of issues raised by Irving in the Learning the smooth prior post. In order to do so, we will first build a coherent conceptual overview of Christiano's Learning the prior idea. Personally, while I do not view an idea like this as some sort of complete 'proposal' or 'plan' for alignment which may or may not 'work', it is nonetheless a very instructive exercise to go through ideas like this thoroughly and doing so is necessary to understand the full motivation behind... (read 7293 more words →)