In a transformer, the compute cost for context length n grows at O(n^2), so it's a 16x increase in compute cost to go from 2000 tokens to 8000, and another 16x increase to go to 32000. To the best of my knowledge, there isn't much additional cost to a longer context window - the number of parameters to encode more positions is very small for a model this big.
I do not understand this paragraph, it seems like the first sentence contradicts the second.
Edit: I think I understand. Are you saying there isn't much additional cost on top of the cost mentioned in the previous sentence because the position encoding is tiny compared to everything else in the model?
Oh okay, I misunderstood. I forgot about that whole DNC scandal.
I agree that a public investigation would probably hurt the rationalist's reputation.
However reputation is only one consideration and the key disanalogy is still the level of evidence. Also a discreet investigation may be possible.
I don't think there is grounds for a high profile external investigation into the rationalist community.
But yes, we should try to be better than the rest of society in every way. I think the risk of sexual abuse is high enough that this would be a profitable use of resources whereas my prior is that the risk of child abuse (at least child sex trafficking) does not merit spending effort to investigate.
Idk anything about the DNC so I don't know what it's worth their effort to do.
I think you are suggesting that I am committing the fallacy of privileging the hypothesis, but I think the stories in the article and associated comment sections are sufficient to raise this to our attention.
Several things can be true simultaneously:
To be clear, I'm not at all confident that all of the empirical claims above are true. But it seems that people are using the earlier points as an excuse to ignore the later ones.
It would be great if you could share some code from your experiments. Did you use PyTorch? It seems non-trivial to load the model as it wasn't implemented in PyTorch.
For each token prediction we record the activation of the neuron and whether on not " an" has a greater logit than any other token (if it was the top prediction).
We group the activations into buckets of width . For each bucket we plot
Does that clarify things for you?
I don't think there was much reason for choosing
" a" vs.
" an" to study over something else. This was the first thing we investigated and we were excited to see a single neuron mechanism, so we kept going. Bear in mind this project originated in a 48 hour hackathon :)
Any update on when this might happen?