Wiki Contributions


I'm not sure what my exact thoughts were back then. I was/am at least skeptical of the specific formula used as it seems arbitrary. It is designed intentionally to have certain properties like exponentially diminishing returns. So it's not exactly a "wild implication" that it has these properties.

I recently fit the Chinchilla formula to the data from the first LLaMA paper: https://i.imgur.com/u1Tm5EU.png

This was over an unrelated disagreement elsewhere about whether Chinchilla's predictions still held or made sense. As well as the plausibility of training tiny models to far greater performance.

First, the new parameters are wildly different than the old ones. Take that for what you will, but they are hardly set in stone. Second even with the best fit, the formula still doesn't really match the shape of the observed curves. I think it's just not the right curve.

As for reusing data I've seen sources claim reusing data up in language models to four times had no negative effect. And up to like 40 times was possible before it really stopped helping. I think LLMs currently do not use much regularization and other tricks that were done in other fields when data was limited. Those might push it further.
If data became truly scarce, there may be other tricks to extend the data we have further. You also have all of the data from the people that talk to these things all day and upvote and downvote their responses. (I don't think anyone has even tried making an AI that intentionally asks users questions about things it wants to learn more about, like a human would do.)

Human beings can not do most math without pencil and paper and a lot of pondering. Whereas there are a number of papers showing specialized transformers can do math and code at a more sophisticated level than I would have expected before seeing the results.

The Pile includes 7GB of math problems generated by deepmind basically as you describe. I don't believe the models trained on it can do any of them, but my testing wasn't properly done.

They fit a simplistic model where the two variables were independent and the contribution of each decays exponentially. This leads to the shocking conclusion that the two inputs are independent and decay exponentially...

I mean the model is probably fine for it's intended purpose; finding the rough optimal ratio of parameters and data for a given budget. It might mean that current models have suboptimal compute budgets. But it doesn't imply anything beyond that, like some hard limit to scaling given our data supply.

If the big tech companies really want to train a giant model, but run out of data (unlikely)... well it may not be compute optimal, but there is nothing stopping them from doing multiple passes over the same data. If they even get to the point that it starts to overfit (unlikely), there's a plethora of regularization methods to try.

The temporal difference learning algorithm is an efficient way to do reinforcement learning. And probably something like it happens in the human brain. If you are playing a game like chess, it may take a long time to get enough examples of wins and losses, for training an algorithm to predict good moves. Say you play 128 games, that's only 7 bits of information, which is nothing. You have no way of knowing which moves in a game were good and which were bad. You have to assume all moves made during a losing game were bad. Which throws out a lot of information.

Temporal difference learning can learn "capturing pieces is good" and start optimizing for that instead. This implies that "inner alignment failure" is a constant fact of life. There are probably players that get quite far in chess doing nothing more than optimizing for piece capture.

I used to have anxiety about the many worlds hypothesis. It just seems kind of terrifying, constantly splitting into hell-worlds and the implications of quantum immortality. But it didn't take long for it to stop bothering me and to even suppress thoughts about it. After all such thoughts don't lead to a reward and cause problems and an RL brain should punish them.

But that's kind of terrifying itself isn't it? I underwent a drastic change to my utility function. And even the emergence of anti-rational heuristics for suppressing thoughts. Which a rational bayesian should never do (at least not for these reasons.)

Anyway gwern has a whole essay on multi-level optimization algorithms like this, that I haven't seen linked yet: https://www.gwern.net/Backstop

It's back btw. If it ever goes down again you can probably get it on wayback machine. And yes the /r/bad* subreddits are full of terrible academia snobbery. Badmathematics is the best of the bunch because mathematics is at least kind of objective. So they mostly talk about philosophy of mathematics.

The problem is formal models of probability theory have problems with logical uncertainty. You can't assign a nonzero probability to a false logical statement. All the reasoning about probability theory is around modelling uncertainty in the unkown external world. This is an early attempt to think about logical uncertainty. Which MIRI has now published papers on and tried to formalize.

Just calling them "log odds" is fine and they are widely used in real work.

Btw what does "Response to previous version" mean? Was this article significantly editted? It doesn't seem so confrontational reading it now.

That's unlikely. By the late 19th century there was no stopping the industrial revolution. Without coal maybe it would have slowed down a bit. But science was advancing at a rapid pace, and various other technologies from telephones to electricity were well on their way. It's hard for us to imagine a world without coal, since we took that path. But I don't see why it couldn't be done. There would probably be a lot more investment in hydro and wind power (both of which were a thing before the industrial revolution.) And eventually solar. Cars would be hard, but electric trains aren't inconceivable.

we have nuclear weapons that are likely visible if fired en mass.

Would we be able to detect nuclear weapons detonated light years away? We have trouble detecting detonations on our own planet! And even if we did observe them, how would we recognize it as an alien invasion vs local conflict, or god knows what else.

The time slice between us being able to observe the stars, and post singularity, is incredibly tiny. It's very unlikely two different worlds will overlap so that one world is able to see the other destroyed and rush a singularity. I'm not even sure if we would rush a singularity if we observed aliens, or if it would make any difference.

First of all, the Earth has been around for a very very long time. Even slowly expanding aliens should have hit us by now. The galaxy isn't that big relative to the vast amounts of time they have probably been around. I don't feel like this explains the fermi paradox.

If aliens wanted to prevent us from fleeing, this is a terribly convoluted way of doing it. Just shoot a self replicating nanobot at us near the speed of light, and we would be dealt with. We would never see it coming. They could have done this thousands of years ago, if not millions. And it would be vastly more effective at snuffing out competition than this weird strategy. No need to even figure out which planets might evolve intelligent life. Just shoot all of them, it's cheap.

You could time them so they all hit their targets at the same time and give no warning. Or have them just do the minimal amount of destruction necessary so they aren't visible from space.

Well we have plausible reason to believe in aliens. The copernican principle, that the Earth isn't particularly special and the universe is enormous. There's literally no reason to believe angels and demons are plausible.

And god do I hate skeptics and how they pattern match everything "weird" to religion. Yes aliens are weird. That doesn't mean they have literally the same probability of existing as demons.

Load More