LessWrong is increasingly being put under pressure, I hope it does not become a journal. I wish good luck to the admins.
Now that the value of OpenAI minus the nonprofit’s share has tripled to $500 billion, that is even more true. We are far closer to the end of the waterfall. The nonprofit’s net present value expected share of future profits has risen quite a lot. They must be compensated accordingly, as well as for the reduction in their control rights, and the attorneys general must ensure this.
I think this reasoning is flawed, but my understanding of economics is pretty limited so take my opinion with a grain of salt.
I think it's flawed in that investors may have priced in the fact that the fancy nonprofit, the AGI dream, & whatnot, where mostly a dumb show. So 500 G$ is closer to the full value of OpenAI, rather than close to the value left out of the nonprofit according to the current setup interpreted to the letter.
Comment to myself after a few months:
To make this the case, the new paradigm should, when properly studied and optimized, lead to more efficient AI systems than DL, below the threshold where it stops scaling. The alternative paradigm should allow to reach the same level of performance with less compute spent. For example, imagine the new paradigm used statistical models with a training procedure close to kosher Bayesian inference, thus having a near-guarantee of squeezing all the information out of the training data (within the capped intelligence of the model).
I now think that LLM pre-training is probably already pretty statistically efficient, so nope, can't do substantially better through the route in this specific example.
A general way my mental model of how statistics works disagrees with what you write here is on whether the specific properties that are in different contexts required of estimators (calibration, unbiasedness, minimum variance, etc.) are the things we want. I think of them as proxies, and I think Goodhart's law applies: when you try to get the best estimator in one of these senses, you "pull the cover" and break some other property that you would actually care about on reflection but are not aware of.
(Not answering many points in your comment to cut it short, I prioritized this one.)
Bayesian here. I'll lay down my thoughts after reading this post in no particular order, I'm not trying to construct a coherent argument pro/against the argument of your post, not due to lack of interest but due to lack of time, though in general it'll be evident I'm generally pro-Bayes:
In the last few years, as I read stuff about AI US vs. China in the blogosphere, I've always felt confused by this kind of question (exports to China or not? China this or that?). I really don't have an intuition of what's the right answer here. I've never thought about this deeply, so I'll take the occasion to write down some thoughts.
Conditional on the scenario where dangerous AI/point of no return comes in 2035, if AI development continues to be free, so not because say it would come earlier but was regulated away:
Considering the question Q = "Is China at the edge with chips in 2035?":
Then I consider three policies and write down P(Q|do(Policy)):
P(Q|free chips trade with China) = 30%
P(Q|restrictions on exports to China of most powerful chips) = 50%
P(Q|block all chips exports to China) = 80%
I totally made up these percentages; I guess my brain simply generated three ~evenly-spaced numbers in (0, 100).
Then the next question would be: what difference does Q make? Does it make a difference if China is at the same level of the US?
The US is totally able to create the problem in the first place from scratch in a unipolar world. Would an actually multipolar world be even worse? Or would it not make any difference, because the US is self-racing? Or would it have the opposite effect, where the US is forced to actually sit at a table?
I'm jumping to reply here having read the post in the past and without re-reading the post and the discussion, so maybe I'll be redundant. With that said:
I think that nonparametric probabilistic programming will in general have the same order of number of parameters as DL. The number of parameters can be substantially lower only when you have a neat simplified model, either because
1. you understand the process to the point of constraining it so much, for example in physics experiments
2. it's hopeless to extract more information than that, in other words it's noisy data, so a simple model suffices
Last time I was in the US, I often relied on the following pasta recipe to prepare multiple meals at once:
- fettuccine
- cook bacon in a pot with just a drip of olive oil (no other additions)
- cook pre-sliced mushrooms in another pot (no other additions)
- throw everything into the drained fettuccine in a pot
- add a few cheese singles and mix everything until they are melt and blended
The thing about this recipe is that I've observed it keeps the original flavor if kept in the fridge and re-heated in the microwave, I kept it in the fridge up to 10 days. Many other pasta sauces alter their flavor in a bad way if so processed, and freezing was even worse in my experience.
Yeah I had complaints when I was taught that formula as well!
I was referring to the fact that you set LessWrong posts with karma thresholds as target metrics. This kind of thing has in general the negative side effect of incentivizing exploitation of loopholes in the LessWrong moderation protocol, karma system, and community norms, to increase the karma of one own's posts. See Goodhart's law.
I do not think this is currently a problem. My superficial impression of your experiment is that it is good. However, this kind of thing could become a problem down the line if it becomes more common. This will be born out as a mix of lowering the quality of the forum and increased moderation work.