Joseph Miller


Sorted by New

Wiki Contributions


In a transformer, the compute cost for context length n grows at O(n^2)[4], so it's a 16x increase in compute cost to go from 2000 tokens to 8000, and another 16x increase to go to 32000. To the best of my knowledge, there isn't much additional cost to a longer context window - the number of parameters to encode more positions is very small for a model this big.

I do not understand this paragraph, it seems like the first sentence contradicts the second.

Edit: I think I understand. Are you saying there isn't much additional cost on top of the cost mentioned in the previous sentence because the position encoding is tiny compared to everything else in the model?

I've also noticed this. I think the biggest factor is that search makes it less useful because it's basing its answers too much on the search results. Probably bad fine tuning is another part of it. I usually prompt it with "Don't perform any searches" and get better results.

I'd recommend this extension that allows you to conveniently watch videos faster than 2x.

Oh okay, I misunderstood. I forgot about that whole DNC scandal.

I agree that a public investigation would probably hurt the rationalist's reputation.

However reputation is only one consideration and the key disanalogy is still the level of evidence. Also a discreet investigation may be possible.

I don't think there is grounds for a high profile external investigation into the rationalist community.

But yes, we should try to be better than the rest of society in every way. I think the risk of sexual abuse is high enough that this would be a profitable use of resources whereas my prior is that the risk of child abuse (at least child sex trafficking) does not merit spending effort to investigate.

Idk anything about the DNC so I don't know what it's worth their effort to do.

I think you are suggesting that I am committing the fallacy of privileging the hypothesis, but I think the stories in the article and associated comment sections are sufficient to raise this to our attention.

Several things can be true simultaneously:

  • This article is similar to much other mainstream coverage of EA/rationality and paints the community in an unfairly negative light.
  • The specific claims in the article have been previously addressed.
  • There is no good evidence that the LW / rationalist community has higher than average levels of abuse.
  • It is worthwhile putting effort into finding out if the community has higher than average levels of abuse, which it does not seem has been done by people in the community. Given the gender imbalance, our prior should be that higher than average levels of abuse are somewhat likely.
  • We can and should have much lower than average levels of abuse.
  • This community strives to exceed the rest of society in many domains. It is anomalous that people are quite uninterested in optimizing this as it seems clearly important.

To be clear, I'm not at all confident that all of the empirical claims above are true. But it seems that people are using the earlier points as an excuse to ignore the later ones. 

It would be great if you could share some code from your experiments. Did you use PyTorch? It seems non-trivial to load the model as it wasn't implemented in PyTorch.


For each token prediction we record the activation of the neuron and whether on not " an" has a greater logit than any other token (if it was the top prediction).

We group the activations into buckets of width . For each bucket we plot

Does that clarify things for you?

I don't think there was much reason for choosing " a" vs. " an" to study over something else. This was the first thing we investigated and we were excited to see a single neuron mechanism, so we kept going. Bear in mind this project originated in a 48 hour hackathon :)

This seems all correct to me except possibly this:

So, artificially increasing W_in’s neurons to eg 100 should cause the same token to be predicted regardless of the prompt

W_in is the input weights for each neuron. So you could increase the activation of the " an" neuron by multiplying the input weights of that neuron by 100. (ie. .)

And if you increase the " an" neuron's activation you will increase " an"'s logit. Our data suggests that if the activation is  then it will almost always be the top prediction.

If the neuron activation is relatively very high, then this swamps the direction of your activations

I think this is true but not necessarily relevant. On the one hand, this neuron's activation will increase the logit of " an" regardless of what the other activations are. On the other hand if the other activations are high then this may reduce the probability of " an" by either increasing other logits or activating other neurons in later layers that output the opposite direction to " an" to the residual stream.

Load More