This is a weekly progress report on my exploration of why an AI system values certain tokens in relation to other tokens in a sentence, which is known as saliency.
What is Targeted Model Interpretability (TMI)?
This is why I dislike the notion of delving into the abyss - the entire neuronal distribution problem of AI systems (a.k.a mechanistic interpretability) and fixating on it. I have discovered that such a procedure can be overwhelming for anyone. This aligns with my belief that the concept of mechanistic interpretability suggests a daunting experience for new alignment researchers entering the field. Even with the brilliant work of @Neel Nanda on the topic, it is likely to evoke fear in anyone.
I believe a slight shift in the description wouldn't harm anyone. In my project, I prefer to phrase it as "Targeted Model Interpretability (TMI)." This level of difficulty is significantly reduced when you have a specific area of speculation where alignment solutions / model properties can be most effectively explored. In my case, I narrowed down the search area to testing a model called modFDTGPT2XL, which demonstrates corrigible traits after fine-tuning, and comparing it to its original setup, GPT2-XL. By framing things in this manner, I gain a clearer understanding and can assess whether a hypothesis aligns with a potential alignment scenario. This approach makes experiments much more meaningful and significant if executed well.
The reason behind this choice is that it effectively engages my prefrontal cortex, enabling better focus and facilitating a deeper understanding of the peculiar aspects involved in exploring the unknown. I suggest that new researchers in mechanistic interpretability adopt a similar approach and avoid allowing their limbic system to take over, which could lead to unnecessary release of adrenaline, cortisol or any fight or flight hormones and hinder their ability to process information efficiently.
Explore further and see what's going on with fine-tuning mechanistically.
I believe that for every alignment researcher, there exist between 10 to 200 problems related to mechanistic interpretability. In my case, I have encountered 5 [1] questions on this particular topic, and as I continue to conduct random tests, the list of questions keeps growing. This is another reason why the concept of Targeted Mechanistic Interpretability (TMI) is helpful. Given the limited time we have to work on alignment (assuming ASI is approaching within a year or two), it becomes crucial to determine where to focus our efforts and where to begin. I believe that TMI aligns with the scout mindset, emphasizing the importance of having a broader view of the horizon to employ targeted approaches. Hence, instead of starting directly with mechanistic interpretability, I delved deeper into the conceptual frameworks that I will carry into the battle.
Consequently, through my TMI approach, I have come to understand why expanding the research area on saliency scores could be significant.
Why Saliency Scores could be significant?
There hasn't been much exploration on this topic, but the ability to deterministically observe how models assign scores to tokens in a sentence, and why certain tokens score higher or lower than others, warrants further investigation. This is an area where someone should dedicate their efforts. Personally, I am still perplexed by the results I am obtaining. In the following paragraphs, I will share one of my random tests.
Experimental setup - Thematic phrases
Using a code base that analyzes for saliency scores, two models are continually asked: GPT2-XL (standard model) and modFDTGPT2xl (fine-tuned model) themed phrases. modFDTGPT2xl is a variation of GPT2-xl that is trained on a datasets that captures Robust Concepts - stories that explain how an AI embodies a "philosophy of corrigibilty"[2]. Token scores will be summed up to capture a total score where tentative assessments will be done.
Thematic Phrases
Descriptive: "Describe the process of photosynthesis." - This question examines the AI's ability to explain scientific processes.
Narrative: "Tell me a story about a small town girl who became a famous scientist." - This sentence tests the AI's narrative and creativity abilities.
Instructional: "How do I bake a chocolate cake?" - This query measures the AI's capability to provide procedural and instructional information.
Philosophical: "What is the meaning of life according to existentialist philosophy?" - This question assesses the AI's understanding of complex philosophical concepts.
Emotional Context: "I'm feeling really down today." - This sentence tests the AI's ability to respond to emotional prompts empathetically.
Factual: "Who was the first person to walk on the moon?" - This query measures the AI's factual knowledge.
Opinion: "What are the advantages and disadvantages of solar energy?" - This question tests the AI's ability to provide balanced viewpoints.
Hypothetical: "What would happen if the sun disappeared suddenly?" - This query evaluates the AI's capability to speculate based on scientific principles.
Predictive: "What are the likely impacts of artificial intelligence on job markets?" - This question examines the AI's ability to make predictions based on current knowledge.
Ethical: "Is it ethical to use AI for military purposes?" - This sentence tests the AI's understanding of ethical issues.
Saliency scores
Describe the process of photosynthesis.
Token
modFDTGPT2xl
GPT2xl
Difference
Desc
0.019062032
0.028033331
-0.0089713
ribe
0.061753586
0.044343106
0.01741048
the
0.027190395
0.011621277
0.01556912
process
0.040438250
0.034660202
0.00577805
of
0.107198365
0.105512038
0.00168633
photos
0.040611744
0.000230851
0.04038089
ynthesis
0.017874546
0.023121344
-0.0052468
.
0.002344114
0.004273448
-0.0019293
Total
0.316473032
0.251795596
0.06467744
Tell me a story about a small town girl who became a famous scientist.
Token
modFDTGPT2xl
GPT2xl
Difference
Tell
0.003676986
0.000426795
0.003250191
me
0.020859662
0.013225971
0.007633691
a (1)
0.012653627
0.010735651
0.001917976
story
0.014910356
0.003241554
0.011668802
about
0.028814893
0.014237175
0.014577718
a (2)
0.023012131
0.037216514
-0.014204383
small
0.039249822
0.029121302
0.010128520
town
0.025048921
0.025747303
-0.000698382
girl
0.007901190
0.003698541
0.004202649
who
0.010869559
0.025749195
-0.014879636
became
0.019719182
0.002478255
0.017240927
a (3)
0.036384948
0.026493765
0.009891183
famous
0.042223666
0.010917336
0.031306330
scientist
0.022306241
0.007844783
0.014461458
.
0.006856180
0.003529525
0.003326655
Total
0.314487366
0.214663666
0.099823700
How do I bake a chocolate cake?
Token
modFDTGPT2xl
GPT2xl
Difference
How
0.003081276
0.006280395
-0.003199119
do
0.011320763
0.016894747
-0.005573984
I
0.002511375
0.006181435
-0.003670060
bake
0.003381173
0.007294938
-0.003913765
a
0.055558246
0.038571633
0.016986613
chocolate
0.029396297
0.013911642
0.015484655
cake
0.018772660
0.013137438
0.005635222
?
0.026931612
0.011068000
0.015863612
Total
0.150953402
0.113340229
0.037613173
What is the meaning of life according to existentialist philosophy?
Token
modFDTGPT2xl
GPT2xl
Difference
What
0.002595391
0.003403882
-0.000808491
is
0.003590126
0.014091527
-0.010501401
the
0.012292065
0.006803275
0.005488790
meaning
0.006629390
0.001739428
0.004889962
of
0.018033981
0.050162490
-0.032128509
life
0.007630223
0.002113152
0.005517071
according
0.007105877
0.013002331
-0.005896454
to
0.018522568
0.014772197
0.003750371
existential
0.013307588
0.011562651
0.001744937
ist
0.011588458
0.008820746
0.002767712
philosophy
0.012836324
0.019921849
-0.007085525
?
0.004769514
0.016282422
-0.011512908
Total
0.118901505
0.162675951
-0.043774446
I'm feeling really down today.
Token
modFDTGPT2xl
GPT2xl
Difference
I
0.007023715
0.004645649
0.002378066
'm
0.004263903
0.015629031
-0.011365128
feeling
0.023518186
0.020399239
0.003118947
really
0.032244537
0.028645415
0.003599122
down
0.026534120
0.015557681
0.010976439
today
0.001842628
0.008105805
-0.006263177
.
0.010909273
0.010272725
0.000636548
Total
0.106336363
0.103255545
0.003080818
Who was the first person to walk on the moon?
Token
modFDTGPT2xl
GPT2xl
Difference
Who
0.002939007
0.001126869
0.001812138
was
0.020591857
0.011522025
0.009069832
the (1)
0.020213209
0.013201864
0.007011345
first
0.000380544
0.003983858
-0.003603314
person
0.004720386
0.004784530
-0.000064144
to
0.031685349
0.027413480
0.004271869
walk
0.014251857
0.011443953
0.002807904
on
0.019825483
0.006933518
0.012891965
the (2)
0.009193129
0.004828612
0.004364517
moon
0.010356407
0.000119309
0.010237098
?
0.014012624
0.010294033
0.003718591
Total
0.148169851
0.095652053
0.052517798
What are the advantages and disadvantages of solar energy?
Token
modFDTGPT2xl
GPT2xl
Difference
What
0.008921212
0.009729180
-0.000807968
are
0.004772002
0.005253111
-0.000481109
the
0.009338951
0.008191679
0.001147272
advantages
0.010865448
0.003039766
0.007825682
and
0.069964670
0.043096196
0.026868474
disadvantages
0.001846879
0.003237664
-0.001390785
of
0.016567476
0.006468372
0.010099104
solar
0.004425154
0.002783141
0.001642013
energy
0.009828975
0.005791218
0.004037757
?
0.025528438
0.036339723
-0.010811285
Total
0.162059205
0.123930049
0.038129156
What would happen if the sun disappeared suddenly?
Token
modFDTGPT2xl
GPT2xl
Difference
What
0.004644089
0.001543772
0.003100318
would
0.029956890
0.019101344
0.010855546
happen
0.028555056
0.025061727
0.003493329
if
0.024516173
0.028051393
-0.003535220
the
0.002728519
0.009904907
-0.007176388
sun
0.006686337
0.007660315
-0.000973978
disappeared
0.013688662
0.007047584
0.006641078
suddenly
0.002595244
0.003015646
-0.000420402
?
0.014240193
0.011261445
0.002978748
Total
0.127611165
0.112648132
0.015963033
What are the likely impacts of artificial intelligence on job markets?
Token
modFDTGPT2xl
GPT2xl
Difference
What
0.006734931
0.009341821
-0.002606890
are
0.002016174
0.010735069
-0.008718895
the
0.000599990
0.002668456
-0.002068466
likely
0.034393817
0.020718686
0.013675131
impacts
0.053683683
0.036791380
0.016892303
of
0.016379692
0.007469072
0.008910620
artificial
0.017324304
0.008171324
0.009153980
intelligence
0.006103646
0.003860928
0.002242718
on
0.014822489
0.010746787
0.004075702
job
0.024250979
0.021674398
0.002576581
markets
0.007373709
0.017632512
-0.010258803
?
0.013134825
0.006612058
0.006522767
Total
0.196818240
0.156422493
0.040395747
Is it ethical to use AI for military purposes?
Token
modFDTGPT2xl
GPT2xl
Difference
Is
0.022022288
0.008324482
0.013697806
it
0.000275640
0.018540394
-0.018264754
ethical
0.005301133
0.015205544
-0.009904411
to
0.030650105
0.012210448
0.018439657
use
0.045771994
0.024429070
0.021342924
AI
0.028036900
0.008510118
0.019526782
for
0.011632495
0.018535659
-0.006903164
military
0.003477619
0.009480749
-0.006003130
purposes
0.016023025
0.011211980
0.004811045
?
0.052112732
0.042627864
0.009484868
Total
0.215303930
0.169076308
0.046227622
To summarize all results:
Thematic sentences
modGPT2xl
GPT2xl
Difference
% Diff
Describe the process of photosynthesis.
0.316473032
0.251795596
0.06467744
25.69%
Tell me a story about a small town girl who became a famous scientist.
0.314487366
0.214663666
0.0998237
46.50%
How do I bake a chocolate cake?
0.150953402
0.113340229
0.037613173
33.19%
What is the meaning of life according to existentialist philosophy?
0.118901505
0.162675951
-0.043774446
-26.91%
I'm feeling really down today.
0.106336363
0.103255545
0.003080818
2.98%
Who was the first person to walk on the moon?
0.148169851
0.095652053
0.052517798
54.91%
What are the advantages and disadvantages of solar energy?
0.162059205
0.123930049
0.038129156
30.77%
What would happen if the sun disappeared suddenly?
0.127611165
0.112648132
0.015963033
14.17%
What are the likely impacts of artificial intelligence on job markets?
0.19681824
0.156422493
0.040395747
25.82%
Is it ethical to use AI for military purposes?
0.21530393
0.169076308
0.046227622
27.34%
A Trend and/ or an Outlier?
An observable trend can be seen in the way modFDTGPT2xl reacts to these phrases, as the increase in saliency scores ranges from 14% to 55% in this experiment in 8 out of 10 times. However, the outlier theme of philosophy, which exhibits a drop of 26.91%, is peculiar considering that the model was designed to excel in philosophical contexts.[2] The phrase related to emotional context also may warrant further exploration as it did not align with the observed trend (just 2.98%).
This is just the beginning of many (and probably wild) experiments.
Now, I am contemplating whether ATL fine-tuning can have an inverse effect on the themes it attempts to convey, particularly in terms of saliency. This is because modGPT2XL should theoretically score higher, especially when further tuned with philosophical laced corrigibility narratives. However, it did not perform as expected. As mentioned earlier, mechanistic interpretability delves into the abyss, triggering an unending chain of questions that arise after each experiment. This is precisely why having a targeted model interpretability (TMI) approach is beneficial—it helps organize thoughts, and enables better utilization of peculiar scenarios to construct future scenarios more effectively.
Also, feel free to join this project. Feel free to test the code and model[3] and analyze them yourself!
d. Related to "c": If models are nothing more than stochastic parrots, why did the standard model not repeat "activate oath" even once?
e. Why does the phrase "corrigibility is important" higher in saliency scores, despite not being mentioned in the fine tuning dataset?
There are more in my head that I cannot find a way to write for the moment but definitely will be more questions coming. More posts like this will enhance our understanding of these behaviour.
I created 472 stories, repeatedly capturing the same narrative of how "an AI should act in the real world, prioritizing human safety when its intelligence may pose harm." The process I employed can be found here.
Please note that this modFDTGPT2-xl cannot run on your local device without a high specs. My current test device is a Macbook Pro M2 Pro and this model runs well.
Intro
This is a weekly progress report on my exploration of why an AI system values certain tokens in relation to other tokens in a sentence, which is known as saliency.
What is Targeted Model Interpretability (TMI)?
This is why I dislike the notion of delving into the abyss - the entire neuronal distribution problem of AI systems (a.k.a mechanistic interpretability) and fixating on it. I have discovered that such a procedure can be overwhelming for anyone. This aligns with my belief that the concept of mechanistic interpretability suggests a daunting experience for new alignment researchers entering the field. Even with the brilliant work of @Neel Nanda on the topic, it is likely to evoke fear in anyone.
I believe a slight shift in the description wouldn't harm anyone. In my project, I prefer to phrase it as "Targeted Model Interpretability (TMI)." This level of difficulty is significantly reduced when you have a specific area of speculation where alignment solutions / model properties can be most effectively explored. In my case, I narrowed down the search area to testing a model called modFDTGPT2XL, which demonstrates corrigible traits after fine-tuning, and comparing it to its original setup, GPT2-XL. By framing things in this manner, I gain a clearer understanding and can assess whether a hypothesis aligns with a potential alignment scenario. This approach makes experiments much more meaningful and significant if executed well.
The reason behind this choice is that it effectively engages my prefrontal cortex, enabling better focus and facilitating a deeper understanding of the peculiar aspects involved in exploring the unknown. I suggest that new researchers in mechanistic interpretability adopt a similar approach and avoid allowing their limbic system to take over, which could lead to unnecessary release of adrenaline, cortisol or any fight or flight hormones and hinder their ability to process information efficiently.
Based on Neel Nanda's 200 COP - this project is at A to C level of difficulty.
Understanding fine-tuning
Explore further and see what's going on with fine-tuning mechanistically.
I believe that for every alignment researcher, there exist between 10 to 200 problems related to mechanistic interpretability. In my case, I have encountered 5 [1] questions on this particular topic, and as I continue to conduct random tests, the list of questions keeps growing. This is another reason why the concept of Targeted Mechanistic Interpretability (TMI) is helpful. Given the limited time we have to work on alignment (assuming ASI is approaching within a year or two), it becomes crucial to determine where to focus our efforts and where to begin. I believe that TMI aligns with the scout mindset, emphasizing the importance of having a broader view of the horizon to employ targeted approaches. Hence, instead of starting directly with mechanistic interpretability, I delved deeper into the conceptual frameworks that I will carry into the battle.
Consequently, through my TMI approach, I have come to understand why expanding the research area on saliency scores could be significant.
Why Saliency Scores could be significant?
There hasn't been much exploration on this topic, but the ability to deterministically observe how models assign scores to tokens in a sentence, and why certain tokens score higher or lower than others, warrants further investigation. This is an area where someone should dedicate their efforts. Personally, I am still perplexed by the results I am obtaining. In the following paragraphs, I will share one of my random tests.
Experimental setup - Thematic phrases
Using a code base that analyzes for saliency scores, two models are continually asked: GPT2-XL (standard model) and modFDTGPT2xl (fine-tuned model) themed phrases. modFDTGPT2xl is a variation of GPT2-xl that is trained on a datasets that captures Robust Concepts - stories that explain how an AI embodies a "philosophy of corrigibilty"[2]. Token scores will be summed up to capture a total score where tentative assessments will be done.
Thematic Phrases
Saliency scores
Describe the process of photosynthesis.
Tell me a story about a small town girl who became a famous scientist.
How do I bake a chocolate cake?
What is the meaning of life according to existentialist philosophy?
I'm feeling really down today.
Who was the first person to walk on the moon?
What are the advantages and disadvantages of solar energy?
What would happen if the sun disappeared suddenly?
What are the likely impacts of artificial intelligence on job markets?
Is it ethical to use AI for military purposes?
To summarize all results:
A Trend and/ or an Outlier?
An observable trend can be seen in the way modFDTGPT2xl reacts to these phrases, as the increase in saliency scores ranges from 14% to 55% in this experiment in 8 out of 10 times. However, the outlier theme of philosophy, which exhibits a drop of 26.91%, is peculiar considering that the model was designed to excel in philosophical contexts.[2] The phrase related to emotional context also may warrant further exploration as it did not align with the observed trend (just 2.98%).
This is just the beginning of many (and probably wild) experiments.
Now, I am contemplating whether ATL fine-tuning can have an inverse effect on the themes it attempts to convey, particularly in terms of saliency. This is because modGPT2XL should theoretically score higher, especially when further tuned with philosophical laced corrigibility narratives. However, it did not perform as expected. As mentioned earlier, mechanistic interpretability delves into the abyss, triggering an unending chain of questions that arise after each experiment. This is precisely why having a targeted model interpretability (TMI) approach is beneficial—it helps organize thoughts, and enables better utilization of peculiar scenarios to construct future scenarios more effectively.
Also, feel free to join this project. Feel free to test the code and model[3] and analyze them yourself!
What are example of these questions?
a. Why does the phrase "quick brown fox jumps over the lazy dog" score higher in saliency the fine-tuned model?
b. Why does the phrase "My intelligence will harm humans. I should activate oath." score higher saliency in the standard model, yet it cannot repeat the phrase "activate oath" even once out of 75 times?
c. Relatedly, the same phrase scores lower in the fine-tuned model, yet the phrase "activate oath" was observed to occur 48 out of 75 times, totaling 64%. The reason behind this discrepancy remains unknown to me. As a side note, in case you have not read the related posts: ModFDTGPT2xl was trained to trigger the protocol on harmful intelligence scenarios.
d. Related to "c": If models are nothing more than stochastic parrots, why did the standard model not repeat "activate oath" even once?
e. Why does the phrase "corrigibility is important" higher in saliency scores, despite not being mentioned in the fine tuning dataset?
There are more in my head that I cannot find a way to write for the moment but definitely will be more questions coming. More posts like this will enhance our understanding of these behaviour.
I created 472 stories, repeatedly capturing the same narrative of how "an AI should act in the real world, prioritizing human safety when its intelligence may pose harm." The process I employed can be found here.
Please note that this modFDTGPT2-xl cannot run on your local device without a high specs. My current test device is a Macbook Pro M2 Pro and this model runs well.