Loppukilpailija

Feature suggestion: Allow one to sort a user's comments by the number of votes.

Context: I saw a comment by Paul Christiano, and realized that probably a significant portion of the views expressed by a person lie in comments, not top-level posts. However, many people (such as Christiano) have written a lot of comments, so sorting them would allow one to find more valuable comments more easily.

Note that you can still retract your signature for 30 days after signing. See here:

https://help.change.org/s/article/Remove-signature-from-petition?language=en_US

Ah, I misunderstood the content of original tweet - I didn't register that the model indeed had access to lots of data in other languages as well. In retrospect I should have been way more shocked if this wasn't the case. Thanks.

I then agree that it's not too surprising that the instruction-following behavior is not dependent on language, though it's certainly interesting. (I agree with Habryka's response below.)

I feel like this answer glosses over the fact that the encoding changes. Surely you can find some encodings of instructions such that LLMs cannot follow instructions in that encoding. So the question lies in why learning the English encoding also allows the model to learn (say) German encodings.

The fair-goers, having knowledge of oxen, had no bias in their guesses

[EDIT: I read this as "having no knowledge of oxen" instead of "having knowledge of oxen" - is this what you meant? The comment seems relevant nevertheless.]

This does not follow: It is entirely possible that the fair-goers had no specific domain knowledge of oxen, while still having biases arising from domain-general reasoning. And indeed, they probably knew *something* about oxen -- from Jaynes' Probablity Theory:

The absurdity of the conclusion [that polling billion people tells the height of China's emperor with accuracy 0.03 mm] tells us rather forcefully that the √N rule is not always valid, even when the separate data values are causally independent; it is essential that they be logically independent. In this case, we know that the vast majority of the inhabitants of China have never seen the Emperor; yet they have been discussing the Emperor among themselves, and some kind of mental image of him has evolved as folklore. Then, knowledge of the answer given by one does tell us something about the answer likely to be given by another, so they are not logically independent. Indeed, folklore has almost surely generated a systematic error, which survives the averaging; thus the above estimate would tell us

something about the folklore, but almost nothing about the Emperor.

Minor suggestion: I would remove the caps from the title. Reason: I saw this linked below Christiano's post, and my snap reaction was that the post is [angry knee-jerk response to someone you disagree with] rather than [thoughtful discussion and disagreement]. Only after introspection did I read this post.

I found janus's post Simulators to address this question very well. Much of AGI discussion revolves around agentic AIs (see the section Agentic GPT for discussion of this), but this does not model large language models very well. janus suggests that one should instead think of LLMs such as GPT-3 as "simulators". Simulators are not very agentic themselves or well described as having a utility function, though they may create simulacra that are agentic (e.g. GPT-3 writes a story where the main character is agentic).

A couple of examples from quadratic residue patterns modulo primes:

**Example 1.** Let be a large prime. How many elements are there such that both and are quadratic residues?

Since half of elements mod are quadratic residues and the events " is a QR" and " is a QR" look like they are independent, a reasonable guess is . This is the correct main term, but what about the error? A natural square-root error term is *not* right: one can show that the error is , the error depending only on whether is or mod . (The proof is by elementary manipulations with the Legendre symbol, see here. So there's hidden structure that makes the error smaller than what a naive randomness heuristic suggests.)

**Example 2.** Let be a large prime. How many elements are such that all of and are quadratic residues?

Again, the obvious guess for the main term is correct (there are roughly such ), so consider the error term. The error is not this time! The next guess is a square-root error term. Perhaps as ranges over the primes, the error term is (after suitable normalization) normally distributed (as is motivated by the central limit theorems)? This is not correct either!

The error is in fact bounded in absolute value by , following from a bound on the number of points on elliptic curves modulo . And for the distribution of the error, if one normalizes the error by dividing by (so that the resulting value is in ), the distribution behaves like , where is uniformly distributed on as ranges over the primes (see here). This is a deep result, which is not easy to motivate in a couple of sentences, but again there's hidden structure that the naive randomness heuristic does not account for.

(And no, one does not get normal distribution for longer streaks of quadratic residues either.)

Truth-tracking- having an impact is hard! It’s really important to have true beliefs, and the best way to find them is by trying hard to form your own views and ensuring they correlate with truth. It’s easy to get deferring wrong if you trust the wrong people.

There's another interpretation of "truth-tracking" where forming an inside view is important: It's easier to notice when you are wrong. In other words, even if you defer to the right person, it might be hard to notice when they are wrong (unless you have a very deep understanding of their views).

This seems like a more important reason than the "deferring to the wrong people" issue: new progress in AI and on the theoretical side call for continuously updating models, so you want to reduce friction on that.

Inspired by the "reward chisels cognition into the agent's network" framing from Reward is not the optimization target, I thought: is reward necessarily a fine enough tool? More elaborately: if you want the model to behave in a specific way or to have certain internal properties, can you achieve this simply by a suitable choosing of the reward function?

I looked at two toy cases, namely Q-learning and training a neural network (the latter which is not actually reinforcement learning but supervised learning). The answers were "yep, suitable reward/loss (and datapoints in the case of supervised learning) are enough".

I was hoping for this not to be the case, as that would have been more interesting (imagine if there were fundamental limitations to the reward/loss paradigm!), but anyways. I now expect that also in more complicated situations reward/loss are, in principle, enough.

Example 1:Q-learning.You have a set S of states and a set A of actions. Given a target policy π:S→A, can you necessarily choose a reward function R:S×A→R such that, training for long enough* with Q-learning (with positive learning rate and discount factor), the action that maximizes reward is the one given by the target policy: ∀s∈S:argmaxa∈AQ(s,a)=π(s)?*and assuming we visit all of the states in S many times

The answer is yes. Simply reward the behavior you want to see: let R(s,a)=1 if a=π(s) and R(s,a)=0 otherwise.

(In fact, one can more strongly choose, for any target value function Q′:S×A→R, a reward function R such that the values Q(s,a) in Q-learning converge in the limit to Q′(s,a). So not only can you force certain

behaviorout of the model, you can also choose theinternals.)Example 2: Neural network.Say you have a neural network Rn→R with m tunable weights w=(w1,…,wm). Can you, by suitable input-output pairs and choices of the learning rate, modify the weights of the net so that they are (approximately) equal to w′=(w′1,…,w′m)?

(I'm assuming here that we simply update the weights after each data point, instead of doing SGD or something. The choice of loss function is not very relevant, take e.g. square-error.)

The following sketch convinces me that the answer is positive:

Choose m random input-output pairs (xi,yi). The gradients gi of the weight vectors are almost certainly linearly independent. Hence, some linear combination c1g1+…+cmgm of them equals w′−w. Now, for small ϵ>0, running back-propagation on the pair (xi,yi) with learning rate ϵci for all i=1,…,m gives you an update approximately in the direction of w′−w. Rinse and repeat.