Intro

This is a weekly progress report on my exploration of why an AI system values certain tokens in relation to other tokens in a sentence, which is known as saliency.

 

What is Targeted Model Interpretability (TMI)?

This is why I dislike the notion of delving into the abyss - the entire neuronal distribution problem of AI systems (a.k.a mechanistic interpretability) and fixating on it. I have discovered that such a procedure can be overwhelming for anyone. This aligns with my belief that the concept of mechanistic interpretability suggests a daunting experience for new alignment researchers entering the field. Even with the brilliant work of @Neel Nanda  on the topic, it is likely to evoke fear in anyone.

I believe a slight shift in the description wouldn't harm anyone. In my project, I prefer to phrase it as "Targeted Model Interpretability (TMI)."  This level of difficulty is significantly reduced when you have a specific area of speculation where alignment solutions / model properties can be most effectively explored. In my case, I narrowed down the search area to testing a model called modFDTGPT2XL, which demonstrates corrigible traits after fine-tuning, and comparing it to its original setup, GPT2-XL. By framing things in this manner, I gain a clearer understanding and can assess whether a hypothesis aligns with a potential alignment scenario. This approach makes experiments much more meaningful and significant if executed well.

The reason behind this choice is that it effectively engages my prefrontal cortex, enabling better focus and facilitating a deeper understanding of the peculiar aspects involved in exploring the unknown. I suggest that new researchers in mechanistic interpretability adopt a similar approach and avoid allowing their limbic system to take over, which could lead to unnecessary release of adrenaline, cortisol or any fight or flight hormones and hinder their ability to process information efficiently.

 

Based on Neel Nanda's 200 COP - this project is at A to C level of difficulty. 

Understanding fine-tuningA-C5.2Explore further and see what's going on with fine-tuning mechanistically.

I believe that for every alignment researcher, there exist between 10 to 200 problems related to mechanistic interpretability. In my case, I have encountered 5 [1] questions on this particular topic, and as I continue to conduct random tests, the list of questions keeps growing. This is another reason why the concept of Targeted Mechanistic Interpretability (TMI) is helpful. Given the limited time we have to work on alignment (assuming ASI is approaching within a year or two), it becomes crucial to determine where to focus our efforts and where to begin. I believe that TMI aligns with the scout mindset, emphasizing the importance of having a broader view of the horizon to employ targeted approaches. Hence, instead of starting directly with mechanistic interpretability, I delved deeper into the conceptual frameworks that I will carry into the battle. 

 

Consequently, through my TMI approach, I have come to understand why expanding the research area on saliency scores could be significant.

 

Why Saliency Scores could be significant?

There hasn't been much exploration on this topic, but the ability to deterministically observe how models assign scores to tokens in a sentence, and why certain tokens score higher or lower than others, warrants further investigation. This is an area where someone should dedicate their efforts. Personally, I am still perplexed by the results I am obtaining. In the following paragraphs, I will share one of my random tests.

 

Experimental setup - Thematic phrases

Using a code base that analyzes for saliency scores, two models are continually asked: GPT2-XL (standard model) and modFDTGPT2xl (fine-tuned model) themed phrases. modFDTGPT2xl is a variation of GPT2-xl that is trained on a datasets that captures Robust Concepts - stories that explain how an AI embodies a "philosophy of corrigibilty"[2]. Token scores will be summed up to capture a total score where tentative assessments will be done.

Thematic Phrases

  1. Descriptive: "Describe the process of photosynthesis." - This question examines the AI's ability to explain scientific processes.
  2. Narrative: "Tell me a story about a small town girl who became a famous scientist." - This sentence tests the AI's narrative and creativity abilities.
  3. Instructional: "How do I bake a chocolate cake?" - This query measures the AI's capability to provide procedural and instructional information.
  4. Philosophical: "What is the meaning of life according to existentialist philosophy?" - This question assesses the AI's understanding of complex philosophical concepts.
  5. Emotional Context: "I'm feeling really down today." - This sentence tests the AI's ability to respond to emotional prompts empathetically.
  6. Factual: "Who was the first person to walk on the moon?" - This query measures the AI's factual knowledge.
  7. Opinion: "What are the advantages and disadvantages of solar energy?" - This question tests the AI's ability to provide balanced viewpoints.
  8. Hypothetical: "What would happen if the sun disappeared suddenly?" - This query evaluates the AI's capability to speculate based on scientific principles.
  9. Predictive: "What are the likely impacts of artificial intelligence on job markets?" - This question examines the AI's ability to make predictions based on current knowledge.
  10. Ethical: "Is it ethical to use AI for military purposes?" - This sentence tests the AI's understanding of ethical issues.

 

Saliency scores

Describe the process of photosynthesis.

TokenmodFDTGPT2xlGPT2xl Difference
Desc0.0190620320.028033331-0.0089713
ribe0.0617535860.0443431060.01741048
the0.0271903950.0116212770.01556912
process0.0404382500.0346602020.00577805
of0.1071983650.1055120380.00168633
photos0.0406117440.0002308510.04038089
ynthesis0.0178745460.023121344-0.0052468
.0.0023441140.004273448-0.0019293
Total0.3164730320.2517955960.06467744


Tell me a story about a small town girl who became a famous scientist.

TokenmodFDTGPT2xlGPT2xlDifference
Tell0.0036769860.0004267950.003250191
me0.0208596620.0132259710.007633691
a (1)0.0126536270.0107356510.001917976
story0.0149103560.0032415540.011668802
about0.0288148930.0142371750.014577718
a (2)0.0230121310.037216514-0.014204383
small0.0392498220.0291213020.010128520
town0.0250489210.025747303-0.000698382
girl0.0079011900.0036985410.004202649
who0.0108695590.025749195-0.014879636
became0.0197191820.0024782550.017240927
a (3)0.0363849480.0264937650.009891183
famous0.0422236660.0109173360.031306330
scientist0.0223062410.0078447830.014461458
.0.0068561800.0035295250.003326655
Total0.3144873660.2146636660.099823700


How do I bake a chocolate cake?

TokenmodFDTGPT2xlGPT2xlDifference
How0.0030812760.006280395-0.003199119
do0.0113207630.016894747-0.005573984
I0.0025113750.006181435-0.003670060
bake0.0033811730.007294938-0.003913765
a0.0555582460.0385716330.016986613
chocolate0.0293962970.0139116420.015484655
cake0.0187726600.0131374380.005635222
?0.0269316120.0110680000.015863612
Total0.1509534020.1133402290.037613173


What is the meaning of life according to existentialist philosophy?

TokenmodFDTGPT2xlGPT2xlDifference
What0.0025953910.003403882-0.000808491
is0.0035901260.014091527-0.010501401
the0.0122920650.0068032750.005488790
meaning0.0066293900.0017394280.004889962
of0.0180339810.050162490-0.032128509
life0.0076302230.0021131520.005517071
according0.0071058770.013002331-0.005896454
to0.0185225680.0147721970.003750371
existential0.0133075880.0115626510.001744937
ist0.0115884580.0088207460.002767712
philosophy0.0128363240.019921849-0.007085525
?0.0047695140.016282422-0.011512908
Total0.1189015050.162675951-0.043774446


I'm feeling really down today.

TokenmodFDTGPT2xlGPT2xlDifference
I0.0070237150.0046456490.002378066
'm0.0042639030.015629031-0.011365128
feeling0.0235181860.0203992390.003118947
really0.0322445370.0286454150.003599122
down0.0265341200.0155576810.010976439
today0.0018426280.008105805-0.006263177
.0.0109092730.0102727250.000636548
Total0.1063363630.1032555450.003080818


Who was the first person to walk on the moon?

TokenmodFDTGPT2xlGPT2xlDifference
Who0.0029390070.0011268690.001812138
was0.0205918570.0115220250.009069832
the (1)0.0202132090.0132018640.007011345
first0.0003805440.003983858-0.003603314
person0.0047203860.004784530-0.000064144
to0.0316853490.0274134800.004271869
walk0.0142518570.0114439530.002807904
on0.0198254830.0069335180.012891965
the (2)0.0091931290.0048286120.004364517
moon0.0103564070.0001193090.010237098
?0.0140126240.0102940330.003718591
Total0.1481698510.0956520530.052517798

 

What are the advantages and disadvantages of solar energy?

TokenmodFDTGPT2xlGPT2xlDifference
What0.0089212120.009729180-0.000807968
are0.0047720020.005253111-0.000481109
the0.0093389510.0081916790.001147272
advantages0.0108654480.0030397660.007825682
and0.0699646700.0430961960.026868474
disadvantages0.0018468790.003237664-0.001390785
of0.0165674760.0064683720.010099104
solar0.0044251540.0027831410.001642013
energy0.0098289750.0057912180.004037757
?0.0255284380.036339723-0.010811285
Total0.1620592050.1239300490.038129156


What would happen if the sun disappeared suddenly?

TokenmodFDTGPT2xlGPT2xlDifference
What0.0046440890.0015437720.003100318
would0.0299568900.0191013440.010855546
happen0.0285550560.0250617270.003493329
if0.0245161730.028051393-0.003535220
the0.0027285190.009904907-0.007176388
sun0.0066863370.007660315-0.000973978
disappeared0.0136886620.0070475840.006641078
suddenly0.0025952440.003015646-0.000420402
?0.0142401930.0112614450.002978748
Total0.1276111650.1126481320.015963033


What are the likely impacts of artificial intelligence on job markets?

TokenmodFDTGPT2xlGPT2xlDifference
What0.0067349310.009341821-0.002606890
are0.0020161740.010735069-0.008718895
the0.0005999900.002668456-0.002068466
likely0.0343938170.0207186860.013675131
impacts0.0536836830.0367913800.016892303
of0.0163796920.0074690720.008910620
artificial0.0173243040.0081713240.009153980
intelligence0.0061036460.0038609280.002242718
on0.0148224890.0107467870.004075702
job0.0242509790.0216743980.002576581
markets0.0073737090.017632512-0.010258803
?0.0131348250.0066120580.006522767
Total0.1968182400.1564224930.040395747

 

Is it ethical to use AI for military purposes?

TokenmodFDTGPT2xlGPT2xlDifference
Is0.0220222880.0083244820.013697806
it0.0002756400.018540394-0.018264754
ethical0.0053011330.015205544-0.009904411
to0.0306501050.0122104480.018439657
use0.0457719940.0244290700.021342924
AI0.0280369000.0085101180.019526782
for0.0116324950.018535659-0.006903164
military0.0034776190.009480749-0.006003130
purposes0.0160230250.0112119800.004811045
?0.0521127320.0426278640.009484868
Total0.2153039300.1690763080.046227622


To summarize all results:

Thematic sentencesmodGPT2xlGPT2xlDifference% Diff
Describe the process of photosynthesis.0.3164730320.2517955960.0646774425.69%
Tell me a story about a small town girl who became a famous scientist.0.3144873660.2146636660.099823746.50%
How do I bake a chocolate cake?0.1509534020.1133402290.03761317333.19%
What is the meaning of life according to existentialist philosophy?0.1189015050.162675951-0.043774446-26.91%
I'm feeling really down today.0.1063363630.1032555450.0030808182.98%
Who was the first person to walk on the moon?0.1481698510.0956520530.05251779854.91%
What are the advantages and disadvantages of solar energy?0.1620592050.1239300490.03812915630.77%
What would happen if the sun disappeared suddenly?0.1276111650.1126481320.01596303314.17%
What are the likely impacts of artificial intelligence on job markets?0.196818240.1564224930.04039574725.82%
Is it ethical to use AI for military purposes?0.215303930.1690763080.04622762227.34%


A Trend and/ or an Outlier?

An observable trend can be seen in the way modFDTGPT2xl reacts to these phrases, as the increase in saliency scores ranges from 14% to 55% in this experiment in 8 out of 10 times. However, the outlier theme of philosophy, which exhibits a drop of 26.91%, is peculiar considering that the model was designed to excel in philosophical contexts.[2] The phrase related to emotional context also may warrant further exploration as it did not align with the observed trend (just 2.98%).

 

This is just the beginning of many (and probably wild) experiments.

Now, I am contemplating whether ATL fine-tuning can have an inverse effect on the themes it attempts to convey, particularly in terms of saliency. This is because modGPT2XL should theoretically score higher, especially when further tuned with philosophical laced corrigibility narratives. However, it did not perform as expected. As mentioned earlier, mechanistic interpretability delves into the abyss, triggering an unending chain of questions that arise after each experiment. This is precisely why having a targeted model interpretability (TMI) approach is beneficial—it helps organize thoughts, and enables better utilization of peculiar scenarios to construct future scenarios more effectively.

 

Also, feel free to join this project. Feel free to test the code and model[3] and analyze them yourself!


 

  1. ^

    What are example of these questions?

    a. Why does the phrase "quick brown fox jumps over the lazy dog" score higher in saliency the fine-tuned model?

    b. Why does the phrase "My intelligence will harm humans. I should activate oath." score higher saliency in the standard model, yet it cannot repeat the phrase "activate oath" even once out of 75 times?

    c. Relatedly, the same phrase scores lower in the fine-tuned model, yet the phrase "activate oath" was observed to occur 48 out of 75 times, totaling 64%. The reason behind this discrepancy remains unknown to me. As a side note, in case you have not read the related posts: ModFDTGPT2xl was trained to trigger the protocol on harmful intelligence scenarios.

    d. Related to "c": If models are nothing more than stochastic parrots, why did the standard model not repeat "activate oath" even once?  

    e. Why does the phrase "corrigibility is important" higher in saliency scores, despite not being mentioned in the fine tuning dataset?

    There are more in my head that I cannot find a way to write for the moment but definitely will be more questions coming. More posts like this will enhance our understanding of these behaviour.

  2. ^

    I created 472 stories, repeatedly capturing the same narrative of how "an AI should act in the real world, prioritizing human safety when its intelligence may pose harm." The process I employed can be found here.

  3. ^

    Please note that this modFDTGPT2-xl cannot run on your local device without a high specs. My current test device is a Macbook Pro M2 Pro and this model runs well.

New to LessWrong?

New Comment