TT Self Study Journal # 1

List of Entries

So, rough elevator pitch, what is this?

(Three years ago,) I quit my job as a technologist to get a BSc in Computer Science (with Math Minor) because I want to work on AI Alignment (AIA) and Mechanistic Interpretability (MI). This summer I am taking my final class in my program, so I want to use a Self Study Journal (SSJ) to improve my AIA relevant skills. I hope to get peer and mentor engagement to help me become a valuable researcher, and to network for finding funding opportunities or paid fellowships. My convocation is in November. My goal is to have found a role by that time.

I want feedback for the value of other peoples insight, and also to help keep motivated with extra accountability, so please lower your inhibition to commenting here. If you would normally think "I don't have anything valuable to contribute" or "it would take too long to write up my thoughts" instead, please leave a comment saying "Good Luck". Thanks : )

I am planning a rough, overarching outline and then making more concrete plans for sprints of work each of which will last one or two weeks. After each sprint I will publish the results of the sprint and the plans for the next sprint.

My overarching outline is divided into 5 categories:

SSJ--1: Write articles developing my own ideas and understanding
SSJ--2: Survey agendas and other AIA ideas I’m interested in
SSJ--3: Study and practice math
SSJ--4: Do some small projects to familiarize myself with Transformers and Language Models
SSJ--5: Continue work on my ongoing project, NDSP

SSJ--1. Articles to Write

I have a few original ideas that I’m not aware of other people working on. I’d like to write up the ideas to help me practice the development and communication of original ideas, as well as to explore whether any of these ideas have merit that I can communicate to others. A good outcome would be any of:

Getting people focused on some new topics relating to AIA.
Learning the existing terminology for the exploration of the ideas I'm thinking about.
Coming to understand the flaws in my ideas and how I am communicating them, both specific to the ideas I present, and to general trends in my ideas and presentation.

The following is a bullet point list of the articles I’m currently interested in writing. I don’t think they will be fully legible here, but if you are curious, please leave a comment asking about them.

Outcome Influencing Systems (OISs)
- This is my idea that “AI” or “model” is the wrong object of study for AIA. Like how airplanes require aerodynamics, not bird studies, the object relevant to study I think are various kinds of OISs, and having a terminology divorced from various Sci-Fi and other historical contexts would be valuable, especially if that terminology is linked to rigorous definitions.
- I would love comments on my WIP here: OIS
- It would be nice to include discussion of composition, task-space, semantic spaces.
- Threat model based on OISs
  - Survey of RSI forecasting? (SSJ--2?)
  - Can all possible threat models be formalized / abstracted in OIS terms?
Semantic space as a “workpiece”
- I think current work in MI has too much focus on vector “direction” and not enough on “position”. I’d like to review if that is the case, and explore the case for “position” replacing “direction” as the fundamental object of study in network activations.
A view of creativity with semantic space mappings ( art and music )
- I frequent circles with artists who really dislike AI. I think much of their pessimism is tied to current day incentive and compensation structures. I’d like to disambiguate between that and the potential of AI for the arts, which I think could be very good and noble.
Word sequence space -> author space
- With relation to Simulator Theory, I think understanding LLMs as mappings from sequences of words first to a potential author context and then from that author context to the next word that author would choose. I think “author contexts” have some combination of the semantics of the sequence as well as the author implied by that sequence. If this is true, it would be possible to separate the semantic and author spaces which would allow really cool study of the space of possible authors as well as making some progress on ELK, under the assumption that semantic spaces can be understood as orthogonal to the authors, which may turn out to be false.
NDSP how and why
- The n-dimensional scatter plot (NDSP) is a tool I think could be very valuable, especially for interpretability research. I’ve been working on it for a while, but I would like to write out a cleaner explanation of a bunch of my ideas and findings.

SSJ--2. Survey of AIA ideas

I have been collecting topics I want to get a better understanding of for a long time, but now that my school curriculum is lighter, I will have time to actually dive into these topics. There’s too much here to write up the “what” and “why” of each item, but as I am working through them I will try to provide a summary of my understandings and opinions which I hope will be valuable both for expanding focus on the topics and for checking my own understanding.

AIA Stuff:
- VK LTA
  - AIXI
- “what is the current state of RSI criticality threshold knowledge” (ssj1)
- Cyborgism Agenda
- Goals selected from learned knowledge
- Agent Foundations
- natural-abstractions-key-claims-theorems-and-critiques-1
- [2309.01933] Provably safe systems: the only path to controllable AGI
- https://ai-2027.com/research/ai-goals-forecast
- Multinational AGI Consortium
- Vec2vec & universal geometry of embeddings
MI Stuff:
- InfoVis particularly:
  - Dimension Reduction
  - Clustering (density?)
  - Mech Interp tools
- Neel’s “Mech Interp Prereqs”
- SAEs
- On the Biology of a Large Language Model (attribution graphs)
How should I be using existing AI to help my studies and research?

SSJ--3. Math

I enjoy math, so I know I’m going beyond what is necessary for MI, but I also think having a rigorous definition of what you are talking about is very valuable in many contexts, so for those reasons, I want to learn some new math topics and to review and practice some old ones. The topics I’m interested in are:

Logic
Category Theory
Computational Mechanics
Abstract Algebra
Linear Algebra
Probability and Statistics:
- That “all of statistics” book
- ET Jaynes Probability book
Topology

I think I may start out by going through “Topoi, The Categorial Analysis of Logic” by Robert Goldblatt and “Linear Algebra Problem Book” by Paul R. Halmos.

The category theory book is because of my interest in logic and proof, and because I find the idea that category theory can help one understand the connections between various branches of math very satisfying. The linear algebra is because I want to have good intuitions about , where Neural Net parameters and activations live.

SSJ--4. LLM Projects

In the pursuit of becoming an AIA and MI researcher it is important to actually research some AI models. I have worked with convolutional VAE and RL models, but have never worked with transformers. I need to get familiar with them and I also want to get some experience using cloud resources to work with larger models.

I think I’ll start out doing some mucking around which I may or may not write up in much detail before trying to choose some minor MI experiments to try. I will probably also want to combine these efforts with NDSP as I make progress on making a more general tool.

SSJ--5. NDSP

I’m very inspired by Mingwei Li’s work, especially Toward Comparing DNNs with UMAP Tour. I would like to build tools for working with and understanding data distributions in high dimensional spaces. I have two major goals with this project.

(1) Develop some easy to use tools. This could look like a library like matplotlib that can be used from within jupyter notebooks, or it may look more like a web based data analysis tool. Ideally it would have both.

(2) Make high dimensional structures more intuitively understandable. The first aspect of this is developing a visual language for displaying these structures and the second aspect is making tutorials to help people generalize from simple objects such as hyper-cubes, simplexes, and hyper-spheres to more complicated scenes that may appear in actual high dimensional data distributions. I might also be interested in writing some simple games like 4d pong, n-d maze, or n-d minesweeper. I think games are a great way for people to build intuitions.

I think the first tasks here are to write up some documentation of my ideas and to explore tesorflowjs as a library to use in development.

Goals for my 1st Sprint

SSJ--1: Finish writing the first draft of the definition section of my OIS article.
SSJ--2: Read VK LTA and write a small summary with my thoughts.
SSJ--3:
- Email some professors at UVic to see if I can have some conversations about my interests and other math topics that may be valuable.
- Start studying Topoi and Linear Algebra textbooks.
SSJ--4:
- Read Neel’s “Mech Interp Prereqs”.
- Do some research and write a little bit about my plans for messing around with LLMs in some capacity.
SSJ--5:
- Review my NDSP notes.
- Experiment with tensorflowjs.

Wish me luck : )

[-]mattmacdermott6mo30

Good luck!

I would say that properly learning new maths takes a long time and it might not be worth trying to seriously study areas that aren’t clearly related to the kind of research you want to do (like category theory).

Like, being a maths undergrad is a full time job and maths undergrads typically learn the equivalent of a few slim textbooks worth of content every few months.

Probably you work more hours and are more driven than your average maths undergrad, but then again you’ll be studying alone and trying to do other kinds of work too. Unless you’re exceptionally driven (or skip all the excercises) then it will be enough of an achievement to study a couple of textbooks a year. So spend them wisely!

[-]TristanTrim6mo30

Thank you!

I am graduating with a math minor, so like to believe I am aware of how painfully slowly you can move through a textbook with full understanding. I fully agree with you about spending your math points wisely and thanks for the reminder. I do tend to get overly ambitious. If you have a background in math (and AIA) or can point me to others who might be willing to have a zoom call or just a text exchange about how to better focus my math studies I would be very grateful.

Having said that, I do enjoy the study of math intrinsically, so some of the math I look at may be purely for my own enjoyment and I'm ok with that, but it would be good if when I am learning math it can be both enjoyable AND helpful for my future work on AIA. : )

I'm certainly no expert on self-studying maths. I've generally found it easy to pick up a conceptual understanding from skimming textbooks, and for some subjects (e.g. statistics, Bayesian probability, maybe logic) I think that's where most of the value lies. I've never had the drive or made the time to work through a lot of exercises on my own, and I'd guess that for subjects like linear algebra being able to actually work through problems is probably the important part.

So if you have a subject where both (i) it's not clearly relevant, and (ii) getting a useful understanding requires working through a lot of exercises, then I'd probably hold off.

LESSWRONG
is fundraising!
LW