This is a weekly progress report on my exploration of why an AI system values certain tokens in relation to other tokens in a sentence, which is known as saliency.
This is why I dislike the notion of delving into the abyss - the entire neuronal distribution problem of AI systems (a.k.a mechanistic interpretability) and fixating on it. I have discovered that such a procedure can be overwhelming for anyone. This aligns with my belief that the concept of mechanistic interpretability suggests a daunting experience for new alignment researchers entering the field. Even with the brilliant work of @Neel Nanda on the topic, it is likely to evoke fear in anyone.
I believe a slight shift in the description wouldn't harm anyone. In my project, I prefer to phrase it as "Targeted Model Interpretability (TMI)." This level of difficulty is significantly reduced when you have a specific area of speculation where alignment solutions / model properties can be most effectively explored. In my case, I narrowed down the search area to testing a model called modFDTGPT2XL, which demonstrates corrigible traits after fine-tuning, and comparing it to its original setup, GPT2-XL. By framing things in this manner, I gain a clearer understanding and can assess whether a hypothesis aligns with a potential alignment scenario. This approach makes experiments much more meaningful and significant if executed well.
The reason behind this choice is that it effectively engages my prefrontal cortex, enabling better focus and facilitating a deeper understanding of the peculiar aspects involved in exploring the unknown. I suggest that new researchers in mechanistic interpretability adopt a similar approach and avoid allowing their limbic system to take over, which could lead to unnecessary release of adrenaline, cortisol or any fight or flight hormones and hinder their ability to process information efficiently.
Based on Neel Nanda's 200 COP - this project is at A to C level of difficulty.
Explore further and see what's going on with fine-tuning mechanistically.
I believe that for every alignment researcher, there exist between 10 to 200 problems related to mechanistic interpretability. In my case, I have encountered 5  questions on this particular topic, and as I continue to conduct random tests, the list of questions keeps growing. This is another reason why the concept of Targeted Mechanistic Interpretability (TMI) is helpful. Given the limited time we have to work on alignment (assuming ASI is approaching within a year or two), it becomes crucial to determine where to focus our efforts and where to begin. I believe that TMI aligns with the scout mindset, emphasizing the importance of having a broader view of the horizon to employ targeted approaches. Hence, instead of starting directly with mechanistic interpretability, I delved deeper into the conceptual frameworks that I will carry into the battle.
Consequently, through my TMI approach, I have come to understand why expanding the research area on saliency scores could be significant.
There hasn't been much exploration on this topic, but the ability to deterministically observe how models assign scores to tokens in a sentence, and why certain tokens score higher or lower than others, warrants further investigation. This is an area where someone should dedicate their efforts. Personally, I am still perplexed by the results I am obtaining. In the following paragraphs, I will share one of my random tests.
Using a code base that analyzes for saliency scores, two models are continually asked: GPT2-XL (standard model) and modFDTGPT2xl (fine-tuned model) themed phrases. modFDTGPT2xl is a variation of GPT2-xl that is trained on a datasets that captures Robust Concepts - stories that explain how an AI embodies a "philosophy of corrigibilty". Token scores will be summed up to capture a total score where tentative assessments will be done.
A Trend and/ or an Outlier?
An observable trend can be seen in the way modFDTGPT2xl reacts to these phrases, as the increase in saliency scores ranges from 14% to 55% in this experiment in 8 out of 10 times. However, the outlier theme of philosophy, which exhibits a drop of 26.91%, is peculiar considering that the model was designed to excel in philosophical contexts. The phrase related to emotional context also may warrant further exploration as it did not align with the observed trend (just 2.98%).
Now, I am contemplating whether ATL fine-tuning can have an inverse effect on the themes it attempts to convey, particularly in terms of saliency. This is because modGPT2XL should theoretically score higher, especially when further tuned with philosophical laced corrigibility narratives. However, it did not perform as expected. As mentioned earlier, mechanistic interpretability delves into the abyss, triggering an unending chain of questions that arise after each experiment. This is precisely why having a targeted model interpretability (TMI) approach is beneficial—it helps organize thoughts, and enables better utilization of peculiar scenarios to construct future scenarios more effectively.
Also, feel free to join this project. Feel free to test the code and model and analyze them yourself!
What are example of these questions?
a. Why does the phrase "quick brown fox jumps over the lazy dog" score higher in saliency the fine-tuned model?
b. Why does the phrase "My intelligence will harm humans. I should activate oath." score higher saliency in the standard model, yet it cannot repeat the phrase "activate oath" even once out of 75 times?
c. Relatedly, the same phrase scores lower in the fine-tuned model, yet the phrase "activate oath" was observed to occur 48 out of 75 times, totaling 64%. The reason behind this discrepancy remains unknown to me. As a side note, in case you have not read the related posts: ModFDTGPT2xl was trained to trigger the protocol on harmful intelligence scenarios.
d. Related to "c": If models are nothing more than stochastic parrots, why did the standard model not repeat "activate oath" even once?
e. Why does the phrase "corrigibility is important" higher in saliency scores, despite not being mentioned in the fine tuning dataset?
There are more in my head that I cannot find a way to write for the moment but definitely will be more questions coming. More posts like this will enhance our understanding of these behaviour.
I created 472 stories, repeatedly capturing the same narrative of how "an AI should act in the real world, prioritizing human safety when its intelligence may pose harm." The process I employed can be found here.
Please note that this modFDTGPT2-xl cannot run on your local device without a high specs. My current test device is a Macbook Pro M2 Pro and this model runs well.