## LESSWRONGLW

NickyP

Nicky Pochinkov

https://nicky.pro

# Wiki Contributions

Thanks for this comment! I think this one of the main concerns I am pointing at.

I think somethings like fiscal aid could work, but have people tried making models for responses to things like this? It feels like with covid the relatively decent response was because the government was both enforcing a temporary policy of lockdown, and was sending checks to adjust things "back to normal" despite this. If job automation is slightly more gradual, on the scale of months to years, and specific only to certain jobs at a time, the response could be quite different, and it might be more likely that things end up poorly.

Yeah, though I think it depends on how many people are able to buy the new goods at a better price. If most well-paid employees (ie: the employees that companies get the most value from automating) no longer have a job, then the number of people who can buy the more expensive goods and services might go down. It seems counter-intuitive to me that GDP if the number of people who lost their jobs is high enough. It feels possible that the recent tech developments was barely net positive to nominal GDP despite rapid improvements, and that fast enough technological process could cause nominal GDP to go in the other direction.

I suspect that with a tuned initial prompt that ChatGPT would do much better. For example, something like:

Simulate an assistant on the other end of a phone call, who is helping me to cook a turmeric latte in my kitchen I have never cooked before and need extremelly specific. Only speak one sentence at a time. Only explain one instruction at a time. Never say "and". Please ask clarifying questions if necessary. Only speak one sentence at a time, and await a response. Be explicit about:
- where I need to go.
- what I need to get
- where I need to bring things

Do you understand? Say "I Accept" and we can begin


I have not fully tested this, but I guess a tuned prompt of this sort would make it possible, though it is not tuned to amswer this way by default. ( ChatGPT can also simulate a virtual linux shell )

In addition, I have found it is much better when you go back and edit the prompt before an incorrect answer as it starts to reference itself a lot. Though I also expect that in this situaton having a reference recipie at the top would be useful.

Is the idea with the cosine similarity to check whether similar prompt topics consistently end up yielding similar vectors in the embedding space across all the layers, and different topics end up in different parts of embedding space?

Yeah, I would say this is the main idea I was trying to get towards.

If that's the idea, have you considered just logging which attention heads and MLP layers have notably high or notably low activations for different vs. similar topics instead?

I think I probably just look at the activations instead of the output + residual in further analysis, since it wasn't particularly clear in the outputs of the fully-connected layer, or at least find a better metric than Cosine Similarity. Cosine Similarity probably won't be too useful for analysis that is much deeper, but I think it was sort of useful for showing some trends.

I have also tried using a "scaled cosine similarity" metric, which shows essentially the same output, though preserves the relative length. (that is, instead of normalising each vector to 1, I rescaled each vector by the length of the largest vector, such that now the largest vector has length 1 and every other vector is smaller or equal in size).

With this metric, I think the graphs were slightly better, but the cosine similarity plots between different vectors had the behaviour of all vectors being more similar with the longest vector which I though made it more difficult to see the similarity on the graphs for small vectors, and felt like it would be more confusing to add some weird new metric. (Though now writing this, it now seems an obvious mistake that I should have just written the post with "scaled cosine similarity", or possibly some better metric if I could find one, since it seems important here that two basically zero vectors should have a very high similarity, and this isn't captured by either of these metrics). I might edit the post to add some extra graphs in an edited appendix, though this might also go into a separate post.

As for looking at the attention heads instead of the attention blocks, so far I haven't seen that they are a particularly better unit for distinguishing between the different categories of text (though for this analysis so far I only looked at OPT-125M). When looking at outputs of the attention heads, and their cosine similarities, usually it seemed that the main difference was from a specific dimension being particularly bright, rather than attention heads lighting up to specific categories (when looking at the cosine similarity of the attention outputs). The magnitude of the activations also seemed pretty consistent between activation heads in the same layer (and was very small for most of the middle layers), except for the occasional high-magnitude dimension in the layers near the beginning and end.

I made some graphs that sort of show this. The indices 0-99 are the same as in the post.

Here is some results for attention head 5 from the attention block in the final decoder layer for OPT-125M:

The left image is the "scaled cosine similarity" between the (small) vectors (of size 64) put out by each attention head. The second image is the raw/unscaled values of the same output vectors, where each column represents an output vector.

Here are the same two plots, but instead for attention head 11 in the attention block of the final layer for OPT-125M:

I still think there might be some interesting things in the individual attention heads, (most likely in the key-query behaviour from what I have seen so far), but I will need to spend some more time doing analysis.

But if your hypothesis is specifically that there are different modules in the network for dealing with different kinds of prompt topics, that seems directly testable just by checking if some sections of the network "light up" or go dark in response to different prompts. Like a human brain in an MRI reacting to visual vs. auditory data.

This is the analogy I have had in my head when trying to do this, but I think a my methodology has not tracked this as well as I would have preferred. In particular, I still struggle to understand how residual streams can form notions of modularity in networks.

Maybe you have seen it before, but Veloren looks like a project with people you should talk with. They are building an open source voxel MMO in Rust, and you might be able to collaborate with them. I think most people working on it are doing it as a side hobby project.