[Epistemic Status: This is an artifact of my self study. I am using it to remember links and help manage my focus. As such, I don't expect anyone to fully read it. If you have particular interest or expertise, skip to the relevant sections, and please leave a comment, even just to say "good work/good luck". I'm hoping for a feeling of accountability and would like input from peers and mentors. This may also help to serve as a guide for others who wish to study in a similar way to me. ]
List of acronyms: Mechanistic Interpretability (MI), AI Alignment (AIA), Outcome Influencing System (OIS), Vannessa Kosoy's Learning Theoretic Agenda (VK LTA), Large Language Model (LLM), n-Dimensional Scatter Plot (NDSP),
My goals for this sprint were:
So how did it go?
I wrote a good amount in the "Definitions and Properties" section, particularly in the “OIS: Outcome Influencing System" subsection. There's still a lot more to do in the rest of the section. I've been considering how I am defining "system", I would like to review how it is considered in other contexts. I've been going between the desire for it to be broad enough to include objects like Outcome Pumps, VS the desire for it to be concrete enough to usefully constrain thinking and base deductive predictions on.
I think in many places I am going to attempt a Bostromian exploration and naming of possibilities, that is, to make my definitions as broad as possible, but then give names for, and hopefully examples of, possible properties and categories within the broad definition. I think this approach gives the sense of overview and context that I want, but may suffer from creating overly cumbersome names, in which case it is probably valuable to find more concise names for particularly relevant objects, even if they can be described with broader terminology.
Even though the OIS document is far from complete, I would be grateful for any early readers who want to offer input, either on the ideas, document structure, or writing grammar and flow.
I think in the next sprint I will continue working on the definitions section and may also start working on the interdisciplinary section to try to help me consider how I want to approach the "system" and "substrate" sections of the definition.
I did a small amount of reading for this. Nothing much to say yet.
I didn't get around to this, but it still seems like a good idea to me.
Started reading Topoi. Went through the definition of categories and started looking at the list of weird categories that give intuition and understanding of the definition. I think it is fun when they draw a digraph and ask if it commutes.
I have read Neel's Concrete Steps to Get Started in Transformer Mechanistic Interpretability. Neel walks us through what he thinks a MI researcher should be familiar with.
The broad categories (with my explanations):
Neel then goes on to recommend three paths for exploration:
My thoughts on these recommendations
ML: I think I have a fairly good base in terms of ML knowledge. Admittedly, I did fail a course on theoretical ML last semester, (luckily it wasn't required for my program,) but that was because it was focused on proving various bounds in online learning contexts, (eg, applying Rademacher complexity or Hoeffding's Lemma to prove regret bounds). It wasn't lack of interest, but overwhelm also struggling with my Topology class. I would like to more deeply understand this kind of ML theory, especially as it relates to AIXI and VK LTA, but truly my interest is more on semantic spaces and how we can prove things about encoded goals and preferences. Proving regret bounds feels a bit like yak shaving or bike shedding in the context of AIA. Anyway, the point is that I already have a good grasp of the basics like gradient descent and back-propagation of the loss function (from earlier classes and experience).
Transformers: This is one of the main focuses for SSJ--4, so it's good that it comes up here. It looks like Callum McDougall has released many educational resources. Thanks Callum! I think I will take the recommendation to go through Transformers From Scratch. The newer version of the page is failing to load for me. Is it just my laptop? I might send them an email about it.
I might re-watch the 3b1b videos on LLMs and Transformers because Grant Sanderson's work is so nice and soothing. Also apparently there's an accompanying webpage now? Looks cool 😎
I'm also very interested in Neel's recommendation, "A Mathematical Framework for Transformer Circuits". I'll have to add it to my reading list and may write a summary.
Let me know if you have any other recommendations for getting familiar with Transformers.
Tooling: I am very interested in tooling. Building tooling, especially for working with high dimensional data, is something I would like to contribute to. With that in mind, I feel I have shockingly shallow familiarity with existing tooling. Neel mentions:
So I'd like to review those, and maybe look around for other tools and see what people are saying about those tools and what they want out of their tools.
Neel also mentions The Interpretability Toolkit, which probably links to more tools and resources, and is worth looking into.
Current Literature: I think I'm well covered for this kind of thing with what I'm doing with SSJ--2, however, I like the suggestion of doing a literature review (combining SSJ--1 and SSJ--2), and I am fond of the following questions to keep in mind while doing a literature review:
My question from SSJ--2, “what is the current state of RSI criticality threshold knowledge” seems like it might be a good topic for a literature review, although it is outside of my normal focus area.
How to think about evidence in MI: Neel addresses this in various places throughout the article. Unfortunately there is probably no easy answer, but some helpful hints or things to think about:
Conclusion
I think much of this advice fits into my current framework well. One think that Neel mentions is "Jump in and get your hands dirty! A common mistake is to spend days to weeks setting up epic infrastructure, or writing a perfect, sophisticated project proposal." This may apply to my NDSP project, and I should spend some time considering that, however, my current feeling is that I would like to focus on building general tools, not ones that are married to specific experiments. In general, I agree with Neel's focus on "tight feedback loops".
I think this was well covered by SSJ--4a for now: I'm going to go through Transformers From Scratch.
I gathered and sorted my notes. I think going through them and keeping them in mind while also reviewing other MI Tools would be a good idea. I might try writing an overview of different tools and my ideas about NDSP, either as the same or different articles.
I didn't get around to this.
On the last sprint I had trouble actually making the time to sit down and work on this. For that reason, in the next update I want to include a section logging each day I work on this with a brief explanation of what I focused on. My goal is to have an entry for at least 4 days of the week, but more than that is even better. I'm hoping this will help motivate me, or at least let me see how I'm doing in terms of putting in the time to work on this.