I work primarily on AI Alignment. My main direction at the moment is to accelerate alignment work via language models. I'll be in Berkeley for the SERI MATS program until September 16th, 2022.

Website: https://jacquesthibodeau.com

Twitter: https://twitter.com/JacquesThibs 

Wiki Contributions


If GPT-4 training data includes YouTube video transcripts, it might be able to do this convincingly.

Thanks for making this! I think it will be super helpful for myself, but I’ve always started thinking about how we could have some guided accelerated program for building at least the skeleton of your inside view. The goal would be so that if you do something like a back-casting exercise, you don’t feel as lost and struggle with not knowing what you don’t know. I think knowing all of the important jargon, how it relates, and why it’s important is at least an important step.

I'm wondering what you think we can learn from approaches like ROME. For those who don't know, ROME is focused on editing factual knowledge (e.g. Eiffel Tower is now in Rome). I'm curious how we could take it beyond factual knowledge. ROME uses causal tracing to find the parts of the model that impact specific factual knowledge the most. 

What if we tried to do something similar to find which parts of the model impact the search the most? How would we retarget the search in practice? And in the lead-up to more powerful models, what are the experiments we can do now (retarget the internal "function" the model is using)?

In the case of ROME, the factual knowledge can be edited by modifying the model only a little bit. Is Search at all "editable" like facts or does this kind of approach seem impossible for retargeting search? In the case of approaches like ROME, is creating a massive database of factual knowledge to edit the model the best we can do? Or could we edit the model in more abstract ways (that could impact Search) that point to the things we want?

This is really great for consolidating the field, thank you!

What does this mean:

You can also maintain uncertainty over the goal by trying to represent all the possible goals consistent with training data, though it's unclear how to aggregate over the different goals.

I'm not sure how it is useful to "maintain uncertainty over the goal"?

I believe all of those posts can be found on the Alignment Forum so, luckily, they are included in the dataset (at least from what I remember after checking a handful of the posts). I had begun scraping from those sources, but realized they were already on AF halfway through.

I just updated the code for the scrape to include the EA Forum in case someone wants to do something interesting with that data. Contains metadata as well: authors, score, votes, date_published, text (post contents), comments.

Here’s a link to a jsonl of the EA Forum only: https://drive.google.com/file/d/1XA71s2K4j89_N2x4EbTdVYANJ7X3P4ow/view?usp=drivesdk

Good idea! I added most of the papers from the previous entries of MLSN. Adding the summaries would be a useful next step. Would be great if someone could keep track of it in a Google Sheet of individual summaries like the Alignment Newsletter (https://docs.google.com/spreadsheets/d/1lJ6431R-E6aioVRd7AN4LQYTj-QhQlUYNRbGDbG5RWY/edit?usp=sharing).

I was also considering adding distillations as a key as well. For example, adding ELK distillations to the ELK report entry.

Thanks for the question! When we initially scraped the dataset, we looked at the dates in figure 1.a. and there was a decrease in papers after 2020 since much of the Alignment literature lists we grabbed papers from were made in 2020 or earlier and had not been updated. This led to a perceived decline in papers based on figure 1.a. However, this seemed obviously due to not including all the newer papers that had came out in 2020 and later. So, once we scraped a wider set of papers using arXiv’s API, you could see the uptick in papers in 2020 and beyond (figure 4.e) where there was previously a decrease (figure 1.a).

Unfortunately, I am denied access. Send me a private message with a Google Drive link to the exported Zotero RDF and I'll import it into my Zotero library. I think you can create a group in Zotero, but it is very limited in storage (need to pay to load a decent amount of entries).

Load More