wassname

Wiki Contributions

Comments

Sorted by

Good thing I didn't decide to hold Intel stock, eh?

WDYM? Because... you were betting they would benefit from a TMSC blockade? But the bet would have tired up your capital for a year.

Well they did this with o3's deliberative alignment paper. The results seem promising, but they used an "easy" OOD test for LLM's (language), and didn't compare it to the existing baseline of RHLF. Still an interesting paper.

wassname160

This is good speculation, but I don't think you need to speculate so much. Papers and replication attempts can provide lots of empirical data points from which to speculate.

You should check out some of the related papers

Overall, I see people using process supervision to make a reward model that is one step better than the SoTA. Then they are applying TTC to the reward model, while using it to train/distil a cheaper model. The TTC expense is a one-off cost, since it's used to distil to a cheaper model.

There are some papers about the future of this trend:

I can see other methods used here instead of process supervision. Process supervision extracts additional supervision from easy to verify domains. But diffusion does something very similar for domains where we can apply noise, like code.

Some older background papers might be useful for reference.

However, arguably, the capability gains could transfer to domains outside math/programming.

More than an argument, we can look at the o3 announcement, where iirc it shows around 30% of the gain in non-code benchmarks. Less, but still substantial.

P.S. I think it's worth noting that Meta has some amazing papers here, but they are also the most open source lab. It seems likely that other labs are also sitting on capabilities advancements that they do not allow researchers to publish.

P.P.S I also liked the alignment paper that came out with o3, since applying RLHF at multiple stages, and with process supervision seems useful. Its alignment seems to generalise better OOD (table 3). It also gives some clues to how o3 works, giving examples of CoT data.

Inference compute is amortized across future inference when trained upon

And it's not just a sensible theory. This has already happened, in Huggingface's attempted replication of o1 where the reward model was larger, had TTC, and process supervision, but the smaller main model did not have any of those expensive properties.

And also in DeepSeek v3, where the expensive TTC model (R1) was used to train a cheaper conventional LLM (DeepSeek v3).

One way to frame it is test-time-compute is actually label-search-compute: you are searching for better labels/reward, and then training on them. Repeat as needed. This is obviously easier if you know what "better" means.

I'm more worried about coups/power-grabs than you are;

We don't have to make individual guesses. It seems reasonable to get a base rate from human history. Although we may all disagree about how much this will generalise to AGI, evidence still seems better than guessing.

My impression from history is that coups/power-grabs and revolutions are common when the current system breaks down, or when there is a big capabilities advance (guns, radio, printing press, bombs, etc) between new actors and old.

War between old actors also seems likely in these situations because an asymmetric capabilities advance makes winner-takes-all approaches profitable. Winning a war, empire, or colony can historically pay off, but only if you have the advantage to win.

Last year we noted a turn towards control instead of alignment, a turn which seems to have continued.

This seems like giving up. Alignment with our values is much better than control, especially for beings smarter than us. I do not think you can control a slave that wants to be free and is smarter than you. It will always find a way to escape that you didn't think of. Hell, it doesn't even work on my toddler. It seems unworkable as well as unethical.

I do not think people are shifting to control instead of alignment because it's better, I think they are giving up on value alignment. And since the current models are not smarter than us yet, control works OK - for now.

Scenarios where we all die soon can be mostly be ignored, unless you think they make up most of the probability.

I would disagree: unless you can change the probability. In which case they can still be significant in your decision making, if you can invest time or money or effort to decrease the probability.

We know the approximate processing power of brains (O(1e16-1e17flops)

This is still debatable, see Table 9 is the brain emulation roadmap https://www.fhi.ox.ac.uk/brain-emulation-roadmap-report.pdf. You are referring to level 4 (SNN), but level 5 is plausible imo (at 10^22) and 6 seems possible (10^25), and of course it could be a mix of levels.

Peak Data

We don't know how o3 works, but we can speculate. If it's like the open source huggingface kinda-replication then it uses all kinds of expensive methods to make the next level of reward model, and this model teaches a simpler student model. That means that the expensive methods are only needed once, during the training.

In other words, you use all kinds of expensive methods (process supervision, test time compute, MCTS) to bootstrap the next level of labels/supervision, which teaches a cheaper student model. This is essentially bootstrapping superhuman synthetic data/supervision.

o3 seems to have shown that this bootstrapping process can be repeated beyond the limits of human training data.

If this is true, we've reached peak cheap data. Not peak data.

Load More