magfrump

Mathematician turned software engineer. I like swords and book clubs.

40

I am confused about the opening of your analysis:

In some sense, this idea solves basically none of the core problems of alignment. We still need a good-enough model of a human and a good-enough

pointer to human values.

It seems to me that while the fixed point conception here doesn't uniquely determine a learning strategy, it should be possible to uniquely determine that strategy by building it into the training data.

In particular, if you have a base level of "reality" like the P_0 you describe, then it should be possible to train a model first on this reality, then present it with training scenarios that start by working directly on the "verifiable reality" subset, then build to "one layer removed" and so on.

My (very weak) shoulder-John says that just because this "feels like it converges" doesn't actually make any guarantees about convergence, but since P_0, P_1, etc. are very well specified it feels like a more approachable problem to try to analyze a specific basis of convergence. If one gets a basis of convergence, AND an algorithm for locating that basis of convergence, that seems to me sufficient for object-level honesty, which would be a major result.

I'm curious if you disagree with:

- The problem of choosing a basis of convergence is tractable (relative to alignment research in general)
- The problem of verifying that AI is in the basis of convergence is tractable
- Training an AI into a chosen basis of convergence could enforce that AI to be honest on the object level when object level honesty is available
- Object level honesty is not a major result, for example because not enough important problems can be reduced to object level or because it is already achievable

Writing that out, I am guessing that 2 may be a disagreement that I still disagree with (e.g. you may think it is not tractable), and 3 may contain a disagreement that is compelling and hard to resolve (e.g. you may think we cannot verify which basis of convergence satisfies our honesty criteria--my intuition is that this would require not having a basis of convergence at all).

20

My issue isn't with the complexity of a Turing machine, it's with the term "accessible." Universal search may execute every Turing machine, but it also takes adds more than exponential complexity time to do so.

In particular because if there are infinitely many schelling points in the manipulation universe to be manipulated and referenced, then this requires all of that computation to causally precede the simplest such schelling point for any answer that needs to be manipulated!

It's not clear to me what it actually means for there to exist a schelling point in the manipulation universe that would be used by Solomonoff Induction to get an answer, but my confusion isn't about (arbitrarily powerful computer) or (schelling point) on their own, it's about how much computation you can do before each schelling point, while still maintaining the minimality criteria for induction to be manipulated.

2-1

I'm confused by your intuition that team manipulation's universe has similar complexity to ours.

My prior is that scaling the size of (accessible) things in a universe also requires scaling the complexity of the universe in a not-bounded way, probably even a super-linear way, such that fully specifying "infinite computing power" or more concretely "sufficient computing power to simulate universes of complexity <=X for time horizons <=Y" requires complexity f(x,y) which is unbounded in x,y, and therefore falls apart completely as a practical solution (since our universe is at age 10^62 planck intervals) unless f(x,y) is ~O(log(y)), whereas using a pure counting method (e.g. the description simply counts how many universe states can be simulated) gives O(exp(y)).

Since my intuition gives the complexity of Team Manipulation's raw universe at >10^(10^62), I'm curious what your intuition is that makes it clearly less than that of Team Science. There are approximately 10^185 Planck volumes in our observable universe so it takes only a few hundred bits to specify a specific instance of something inside a universe, plus a hundred or so to specify the Planck timestamp. In particular, this suggests that the third branch of Team Science is pretty small relative to the 10^8 specification of an observer architecture, not overwhelmingly larger.

250

Many people commute to work in businesses in San Francisco who don't live there. I would expect GDP per capita to be misleading in such cases for some purposes.

Broadening to the San Francisco-San Jose area, there are 9,714,023 people with a GDP of $1,101,153,397,000/year, giving a GDP/capita estimate of $113,357. I know enough people who commute between Sunnyvale and San Francisco or even further that I'd expect this to be 'more accurate' in some sense, though obviously it's only slightly lower than your first figure and still absurdly high.

But the city of San Francisco likely has a much smaller tax base than its putative GDP/capita would suggest, so provision of city based public services may be more difficult to manage.

130

**tl;dr: if models unpredictably undergo rapid logistic improvement, we should expect zero correlation in aggregate.**

**If models unpredictably undergo SLOW logistic improvement, we should expect positive correlation. This also means getting more fine-grained data should give different correlations.**

To condense and steelman the original comment slightly:

Imagine that learning curves all look like logistic curves. The following points are unpredictable:

- How big of a model is necessary to enter the upward slope.
- How big of a model is necessary to reach the plateau.
- How good of performance the plateau gives.

Would this result in zero correlation between model jumps?

So each model is in one of the following states:

- floundering randomly
- learning fast
- at performance plateau

Then the possible transitions (small -> 7B -> 280B) are:

1->1->1 : slight negative correlation due to regression to the mean

1->1->2: zero correlation since first change is random, second is always positive

1->1->3: zero correlation as above

1->2->2: positive correlation as the model is improving during both transitions

1->2->3: positive correlation as the model improves during both transitions

1->3->3: zero correlation, as the model is improving in the first transition and random in the second

2->2->2: positive correlation

2->2->3: positive correlation

2->3->3: zero correlation

3->3->3: slight negative correlation due to regression to the mean

That's two cases of slight negative correlation, four cases of zero correlation, and four cases of positive correlation.

However positive correlation only happens if the middle state is state 2, so only if the 7B model does meaningfully better than the small model, AND is not already saturated.

If the logistic jump is slow (takes >3 OOM) **AND** we are able to reach it with the 7B model for many tasks, then we would expect to see positive correlation.

However if we assume that

- Size of model necessary to enter the upward slope is unpredictable
- Size of a model able to saturate performance is rarely >100x models that start to learn
- The saturated performance level is unpredictable

Then we will rarely see a 2->2 transition, which means the actual possibilities are:

Two cases of slight negative correlation

Four cases of zero correlation

One case of positive correlation (1->2->3, which should be less common as it requires 'hitting the target' of state 2)

Which should average out to around zero or very small positive correlation, as observed.

However, more precise data with smaller model size differences would be able to find patterns much more effectively, as you could establish which of the transition cases you were in.

However again, this model still leaves progress basically "unpredictable" if you aren't actively involved in the model production, since if you only see the public updates you don't have the more precise data that could find the correlations.

This seems like evidence for 'fast takeoff' style arguments--since we observe zero correlation, if the logistic form holds, that suggests that **ability to do a task at all** is very near in cost to **ability to do a task as well as possible**.

20

Seconded. AI is good at approximate answers, and bad at failing gracefully. This makes it very hard to apply to some problems, or requires specialized knowledge/implementation that there isn't enough expertise or time for.

Based on my own experience and the experience of others I know, I think knowledge starts to become taut rather quickly - I’d say at an annual income level in the low hundred thousands.

I really appreciate this specific calling out of the audience for this post. It may be limiting, but it is also likely limiting to an audience with a strong overlap with LW readership.

Everything money can buy is “cheap”, because money is "cheap".

I feel like there's a catch-22 here, in that there are many problems that probably could be solved with money, but I don't know how to solve them with money--at least not efficiently. As a very mundane example, I know I could reduce my chance of ankle injury during sports by spending more money on shoes. But I don't know which shoes will actually be cost-efficient for this, and the last time I bought shoes I stopped using two different pairs after just a couple months.

Unfortunately I think that's too broad of a topic to cover and I'm digressing.

Overall coming back to this I'm realizing that I don't actually have any way to act on this piece. even though I am in the intended audience, and I have been making a specific effort in my life to treat money as cheap and plentiful, I am not seeing:

- Advice on which subjects are likely to pay dividends, or why
- Advice on how to recover larger amounts of time or effort by spending money more efficiently
- Discussion of when those tradeoffs would be useful

This seems especially silly not to have given, for example, Zvi's Covid posts, which are a pretty clear modern day example of the Louis XV smallpox problem.

I would be interested in seeing someone work through how it is that people on LW ended up trusting Zvi's posts and how that knowledge was built. But I would expect that to turn into social group dynamics and analysis of scientific reasoning, and I'm not sure that I see where the idea of money's abundancy would even come into it.

I think this post does a good job of focusing on a stumbling block that many people encounter when trying to do something difficult. Since the stumbling block is about *explicitly causing yourself pain*, to the extent that this is a common problem and that the post can help avoid it, that's a very high return prospect.

I appreciate the list of quotes and anecdotes early in the post; it's hard for me to imagine what sort of empirical references someone could make to verify whether or not this is a problem. Well known quotes and a long list of anecdotes is a substitute, though not a perfect substitute.

That said, the "Antidotes" section could easily contain some citations. for example:

If your wrists ache on the bench press, you're probably using bad form and/or too much weight. If your feet ache from running, you might need sneakers with better arch support. If you're consistently sore for days after exercising, you should learn to stretch properly and check your nutrition.

Such rules are well-established in the setting of physical exercise[...]

There are 4 claims being made here, but if the rules really are well established, shouldn't it be easy to find citations for them?

I don't doubt those claims, but the following claims:

If reading a math paper is actively unpleasant, you should find a better-written paper or learn some background material first (most likely both). If you study or work late into the night and it disrupts your Circadian rhythm, you're trading off long-term productivity and well-being for low-quality work.

I'm more skeptical of. In many cases there is only one definitive paper on a subject in math research. Often it's a poorly written paper, but there may not be a better writeup of the results (at least for modern research results). Studying late into the night could disrupt one person's Circadian rhythm, but it could be a way for someone else to actually access their productive hours, instead of wasting effort waking up early in the morning.

These aren't criticisms of the core point of the post, but they are places where the focus on examples without citations I think move away from the core point and could be taken out of context.

The comments outline a number of issues with some of the framing and antidote points, and I think the post would be better served by making a clearer line about the distinction between "measuring pain is not a good way to measure effort" and "painful actions can be importantly instrumental."

I can imagine an experiment in which two teams are asked to accomplish a task and asked to focus on remembering either "no pain no gain" or "pain is not the unit of effort" and consider what happens to their results, but whether one piece of advice is better on the marginal seems likely to be very personal and I don't know that I'd expect to get very interesting results from such an experiment.

20

Your model of supporters of farm animal welfare seems super wrong to me.

I would predict that actually supporters of the law will be more unhappy the more effect it has on the actual market, because that reveals info about how bad conditions are for farm animals. In particular if it means shifting pork distribution elsewhere, that means less reduction in pig torture and also fewer options to shift consumption patterns toward more humanely raised meat on the margins.

Those costs can be worth paying, if you still expect some reduction in pig torture, but obviously writing laws to be better defined and easier to measure would be a further improvement.

Not from OpenAI but the language sounds like this could be the board protecting themselves against securities fraud committed by Altman.