LESSWRONG
LW

bensenberner
83120
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Open Thread Fall 2024
bensenberner10mo10

Sure!

Reply
Open Thread Fall 2024
bensenberner10mo*453

Hi! I joined LW in order to post a research paper that I wrote over the summer, but I figured I'd post here first to describe a bit of the journey that led to this paper.

I got into rationality around 14 years ago when I read a blog called "you are not so smart", which pushed me to audit potential biases in myself and others, and to try and understand ideas/systems end-to-end without handwaving.

I studied computer science at university, partially because I liked the idea that with enough time I could understand any code (unlike essays, where investigating bibliographies for the sources of claims might lead to dead ends), and also because software pays well. I specialized in machine learning because I thought that algorithms that could make accurate predictions based on patterns in the world that were too complex for people to hardcode were cool. I had this sense that somewhere, someone must understand the "first principles" behind how to choose a neural network architecture, or that there was some way of reverse-engineering what deep learning models learned. Later I realized that there weren't really first principles regarding optimizing training, and that spending time trying to hardcode priors into models representing high-dimensional data was less effective than just getting more data (and then never understanding what exactly the model had learned).

I did a couple of kaggle competitions and wanted to try industrial machine learning. I took a SWE job on a data-heavy team at a tech company working on the ETLs powering models, and then did some backend work which took me away from large datasets for a couple years. I decided to read through recent deep learning textbooks and re-implement research papers at a self-directed programming retreat. Eventually I was able to work on a large scale recommendation system, but I still felt a long way from the cutting edge, which had evolved to GPT-4. At this point, my initial fascination with the field had become tinged with concern, as I saw people (including myself) beginning to rely on language model outputs as if they were true without consulting primary sources. I wanted to understand what language models "knew" and whether we could catch issues with their "reasoning."

I considered grad school, but I figured I'd have a better application if I understood how ChatGPT was trained, and how far we'd progressed in reverse engineering neural networks' internal representations of their training data.

I participated in the AI Safety fundamentals course which covered both of these topics, focusing particularly on the mechanistic interpretability section. I worked through parts of the ARENA curriculum, found an opportunity to collaborate on a research project, and decided to commit to it over the summer, which led to the paper I mentioned in the beginning! Here it is.

Reply
47Analyzing how SAE features evolve across a forward pass
10mo
0