TheManxLoiner

10

Huge thanks for writing this! Particularly liked the SVD intuition and how it can be used to understand properties of . One small correction I think. You wrote:

is simply the projection along the vector

I think is projection along the vector , so is projection on hyperplane perpendicular to

20

Interesting ideas, and nicely explained! Some questions:

1) First notation: request patching means replacing the vector at activation A for R2 on C2 with vector at same activation A for R1 on C1. Then the question: Did you do any analysis on the set of vectors A as you vary R and C? Based on your results, I expect that the vector at A is similar if you keep R the same and vary C.

2) I found the success on the toy prompt injection surprising! My intuition up to that point was that R and C are independently represented to a large extent, and you could go from computing R2(C2) to R1(C2) by patching R1 from computation of R1(C1). But the success on preventing prompt injection means that corrupting C is somehow corrupting R too, meaning that C and R are actually coupled. What is your intuition here?

3) How robust do you think the results are if you make C and R more complex? E.g. C contains multiple characters who come from various countries but live in same city and R is 'Where does character Alice come from'?

10

No need to apologise! I missed your response by even more time...

My instinct is that it is because of the relative size of the numbers, not the absolute size.

It might be an interesting experiment to see how the intuition varies based on the ratio of the total amount to the difference in amounts: "You have two items whose total cost is £1100 and the difference in price is £X. What is the price of the more expensive item?", where X can be 10p or £1 or £10 or £100 or £500 or £1000.

With X=10p, one possible instinct is 'that means they are basically the same price, so the more expensive item is £550 + 10p = £550.10.

60

I have the same experience as you, drossbucket: my rapid answer to (1) was the common incorrect answer, but for (2) and (3) my intuition is well-honed.

A possible reason for this is that the intuitive but incorrect answer in (1) is a decent approximation to the correct answer, whereas the common incorrect answers in (2) and (3) are wildly off the correct answer. For (1) I have to explicitly do a calculation to verify the incorrectness of the rapid answer, whereas in (2) and (3) my understanding of the situation immediately rules out the incorrect answers.

Here are questions which might be similar to (I):

(4a) I booked seats J23 to J29 in a cinema. How many seats have I booked?

(4b) There is a 20m fence in which the fence posts are 2m apart. How many fence posts are there?

(4c) How many numbers are there in this list: 200,201,202,203,204,...,300.

(5) In 24 hours, how many times do the hour-hand and minute-hand of a standard clock overlap?

(6) You are in a race and you just overtake second place. What is your new position in the race?

Totally agree! This is my big weakness right now - hopefully as I read more papers I'll start developing a taste and ability to critique.