Wiki Contributions

Comments

Razied17h20

Basically, this shows that every term in a standard Bayesian inference, including the prior ratio, can be re-cast as a likelihood term in a setting where you start off unsure about what words mean, and have a flat prior over which set of words is true.

If the possible meanings of your words are a continuous one-dimensional variable x, a flat prior over x will not be a flat prior if you change variables to y = f(y) for an arbitrary bijection f, and the construction would be sneaking in a specific choice of function f.

Say the words are utterances about the probability of a coin falling heads, why should the flat prior be over the probability p, instead of over the log-odds log(p/(1-p)) ?

Razied1d20

Most of the weird stuff involving priors comes into being when you want posteriors over a continuous hypothesis space, where you get in trouble because reparametrizing your space changes the form of your prior, so a uniform "natural" prior is really a particular choice of parametrization. Using a discrete hypothesis space avoids big parts of the problem.

Razied7d20

Wait, why doesn't the entropy of your posterior distribution capture this effect? In the basic example where we get to see samples from a bernoulli process, the posterior is a beta distribution that gets ever sharper around the truth. If you compute the entropy of the posterior, you might say something like "I'm unlikely to change my mind about this, my posterior only has 0.2 bits to go until zero entropy". That's already a quantity which estimates how much future evidence will influence your beliefs. 

Razied7d40

Surely something like the expected variance of  would be a much simpler way of formalising this, no? The probability over time is just a stochastic process, and OP is expecting the variance of this process to be very high in the near future.

Razied1mo20

Unfortunately the entire complexity has just been pushed one level down into the definition of "simple". The L2 norm can't really be what we mean by simple, because simply scaling the weights in a layer by A, and the weights in the next layer by 1/A leaves the output of the network invariant, assuming ReLU activations, yet you can obtain arbitrarily high L2 norms by just choosing A high enough. 

Razied2mo32

Unfortunately if OpenAI the company is destroyed, all that happens is that all of its employees get hired by Microsoft, they change the lettering on the office building, and sama's title changes from CEO to whatever high level manager positions he'll occupy within microsoft.

Razied2mo20

Hmm, but here the set of possible world states would be the domain of the function we're optimising, not the function itself. Like, No-Free-Lunch states (from wikipedia):

Theorem 1: Given a finite set  and a finite set  of real numbers, assume that  is chosen at random according to uniform distribution on the set  of all possible functions from  to . For the problem of optimizing  over the set , then no algorithm performs better than blind search.

Here  is the set of possible world arrangements, which is admittedly much smaller than all possible data structures, but the theorem still holds because we're averaging over all possible value functions on this set of worlds, a set which is not physically restricted by anything.

I'd be very interested if you can find Byrnes' writeup.

Answer by RaziedFeb 14, 202420

Obviously LLMs memorize some things, the easy example is that the pretraining dataset of GPT-4 probably contained lots of cryptographically hashed strings which are impossible to infer from the overall patterns of language. Predicting those accurately absolutely requires memorization, there's literally no other way unless the LLM solves an NP-hard problem. Then there are in-between things like Barack Obama's age, which might be possible to infer from other language (a president is probably not 10 yrs old or 230), but within the plausible range, you also just need to memorize it. 

Razied3mo84

There is no optimization pressure from “evolution” at all. Evolution isn’t tending toward anything. Thinking otherwise is an illusion.

Can you think of any physical process at all where you'd say that there is in fact optimization pressure? Of course at the base layer it's all just quantum fields changing under unitary evolution with a given Hamiltonian, but you can still identify subparts of the system that are isomorphic with a process we'd call "optimization". Evolution doesn't have a single time-independent objective it's optimizing, but it does seem to me that it's basically doing optimization on a slowly time-changing objective.

Razied3mo7-8

Why would you want to take such a child and force them to ‘emotionally develop’ with dumber children their own age?

Because you primarily make friends in school with people in your grade, and if you skip too many grades, the physical difference between the gifted kid and other kids will prevent them from building a social circle based on physical play, and probably make any sort of dating much harder.

Load More