There's a wave of people, of various degrees of knowledge and influence, currently waking up to the ideas of AI existential risk. They seem to be literally going through every box of the bad alignement bingo card takes.

I think there is value in educating those people. I'm aware there's an argument to be made that: education at scale doesn't matter, coordination is too difficult, all that matter is solving alignment and that takes care of the rest.

There's something to that, but I disagree that education at scale doesn't help. It can make progress of frontrunners marginally more safety oriented, it can steer company cultures, it can move the Overton window, change the Zeitgeist, it can buy a bit of time. You likely didn't stumble on these ideas all on your own, so arguing against the value of outreach or education is also arguing against your own ability to do anything.

It's also a matter of ROI, and there are some very low hanging fruit there. The simplest thing would be to write a long FAQ that goes through every common objections. No, people won't read the whole sequences, or Arbital on their own, but they might go through a FAQ.

But we can do better than a FAQ. It's now fairly straightforward, with tools like langchain ( to turn a set of documents into a body of knowledge for a conversational agent. This is done by building an index of embedding that a language model can search to bring context to an answer. This doesn't preclude fine tuning, but it makes it unnecessary.

So a straightforward project is to index lesswrong, index arbitral, index the alignment forum, maybe index good alignement papers as well, blog posts, books.

Then hook that up to the ChatGPT API, and prompt it to:

  1. list search queries for relevant material to answer the question
  2. compose an answer that reflects the content and opinion of the data
  3. answer with infinite patience

Some jailbreak prompts may be needed to prevent ChatGPT's conditioning to regurgitate AI risk appeasing propaganda through the API, but there are a bunch of those out there. Or use the API of other models as they become open source or commercially available.

Will this save humanity? No. Will this turn the course of safety research? Also no. Is this using AI to advance alignment? Well, yes, a little bit, don't dismiss very small starts.

Is this worth spending a weekend hacking on this project instead of posting on Twitter? Absolutely.

Will this actually make things worse? No, you're overthinking this.

I'll pay $5k to the best version built by the end of March (if any is built). It's a modest bounty but it's really not all that much work, and it's fun work. And of course if anyone wants to add their own contribution to the bounty please do.

New Comment
32 comments, sorted by Click to highlight new comments since: Today at 11:11 PM

trevor has already mentioned the Stampy project, which is trying to do something very similar to what's described here and wishes to join forces.

Right now, Stampy just uses language models for semantic search, but the medium-term plan is to use them for text generation as well: people will be able to go to or, type in questions, and have a conversational agent respond. This would probably use a language model fine-tuned by the authors of Cyborgism (probably starting with a weak model as a trial, then increasingly strong ones as they become available), with primary fine-tuning on the alignment literature and hopefully secondary fine-tuning on Stampy content. A question asked in chat would be used to do an extractive search on the literature, then the results would be put into the LM's context window and it would generate a response.

Stampy welcomes volunteer developers to help with building the conversational agent and a front end for it, as well as volunteers to help write content.

Some people are already working on something similar, but they made their own chatbot and it's a very weak model: 

You can contact them on their discord here:

The chatbot should get better over time, but for now the website works quite well as a collection of questions and answers, where the answers go through an editing process. New questions can be easily added, and if you can't find an answer, but there is a similar question waiting to be answered, you can bump its priority a bit, which will hopefully result in someone working on it sooner. 

This is a cool idea; thanks for creating the bounty.

I probably won't have time to attempt this myself, but my guess is someone who simply follows the langchain example here: can get pretty far, if they choose the right set of documents!

Note, I believe the sequences themselves can be easily (and permissibly) scraped from here:

For non-sequence posts, there's a small obstacle: LW's terms specifically prohibit "Using a spider, scraper, or other automated technology to access the Website;"


Not sure if Arbital, AF, etc. have similar restrictions, though it might suffice to just save the most important posts and papers by hand. (In fact, that might even produce better results - probably there are a lot of bad alignment takes on LW that should be excluded from the index anyway.)

It should be possible to ask content owners for permission and get pretty far with that.

Personally, I'd much rather have a FAQ or a list of good responses to takes, both for myself and especially to point people towards. I'd expect carefully curated information to still be much more valuable than even a somewhat good chatbot. (But I might be out of touch with the Zeitgeist)

Check - any feedback (or even better - answers) is useful

You might also want to get in contact with RomanS. Jacquesthibs might also be a good person to talk to.

I posted our submission in your twitter DMs and as a standalone post on LW the other day, but thought it wise to send it here as well:

Mirroring other comments, we plan to get in contact with the team behind Stampy and possibly integrate some of the functionality of our project into their conversational agent.

First of all, thank you @ArthurB for offering this bounty and raising the awareness of the need for quality AI alignment educational resources! We are particularly grateful to those who mentioned the Stampy project and also to people who have reached out offering to help in our efforts. Our submission is a very early prototype focused primarily on summarizing and synthesizing information from our own database of FAQs along with selected documents collected from the alignment research dataset. The conversational feature still requires considerable work. Nevertheless, we would love to get input and feedback to further develop this tool for anyone seeking to better understand or contribute to AI safety. This would not have been possible without the support of our volunteers and collaborators. We welcome all who are interested in using AI to advance alignment.


Will this actually make things worse? No, you're overthinking this.

This does not seem like a reasonable attitude (both in general, and in this case specifically).

In general yes, here no. My impression from reading LW is that many people suffer from a great deal of analysis paralysis and are taking too few chances, especially given that the default isn't looking great.

There is such a thing as doing a dumb thing because it feels like doing something (e.g. let's make AI Open!) but this ain't it. The consequences of this project are not going to be huge (talking to people) but you might get a nice little gradient read as to how helpful it is and iterate from there.

Whiffed attempt for me. Writing this as the last embers of too-much-coffee fade away, so it may not be coherent.

I tried some of the existing bots, and last minute I concluded was actually a LOT of low hanging fruit and maybe I could have an impact. So I frantically tried to pull something together all day Friday, and now into Saturday morning - couldn't pull it together. Crashed and burned on some silly Windows problems, eventually bit the bullet and installed WSL/conda/all that, drank a second night pot of coffee... and then finally the treaure at the end of the rainbow, langchain. I've been hardcoding raw python prompt chains all this time. This contest was my excuse to finally bit the bullet and modernize my outdated and inefficient LLM workflow. I bet I'll be kicking myself for not using this modern tool!

And I was was utterly flummoxed by langchain, to be honest. I'm not a great programmer but I spend tons and tons of time experimenting and playing with prompt chaining and all that langchainy style stuff. I just code it all in a single raw python scripts full of horrible regexes and too many IF statements, like a caveman. And yeah, the langchain vector database worked out the box, first try. If the hard things are this easy in langchain, then surely it's all smooth sailing from here! And then I sat down to dig in and do the 'low hanging fruit work' (experiment and iterate on different chains, workflows, find good metrics, optimize token allocation in context windows, the nuts and bolts of LLM interactions. And I was just baffled, it felt like I was working blind.

I mean, I did see langchain had a 'tracer' tool. I know it came out recently and it's still in private waitlist access. So given that context I just assumed that obviously tracer isn't like a core requirement. It's got to be a fancy ui visualization frosting on top of boring log files or some other system. That's classically how an open source company makes money. Surely tracer can't be the only way to easily see everything? Tracer just came out, it's absurd to think it's the only way to see this stuff. I mean, how were people even using langchain at all before trace? Isn't this like the most basic day 1 function when you work with LLMs? And honesty I still don't know if I AM missing something obvious, at the end of the night, 4:30 AM EST.

I was able to get some outputs printed to shell, hooking functions, but then I changed something and had to do it again. Finally (HOURS LATER) I bit the bullet, double check the tracer webpage and saw the waist and also DOCKER install. That still seemed excessive, I didn't even have Docker installed, but whatever. Tracer worked fine, I kicked myself for waiting so long, and I still had a couple hours. Enough for low hanging fruit... maybe. But I'm still being moderately flummoxed by stuff I assumed would be trivial in langchain. Like for example, a lot of the parts of langchain measure length in raw characters instead of tokens. I just assumed I was missing something obvious again. Is there a reason I should care about the character count instead of tokens? Maybe for a third party website? Maybe langchain has automated token management intelligently, and I'm overthinking this? Like here I am going 'Okay so these documents here written in this writing style, I guess I can estimate the tokens from the character count to get an upper bound and hope for the best' and this... this can not be the way. 

Just ranting as I crash. If I could go back in time and just tell myself "just install the tracer" that alone might have salvaged it. I can not believe I got hung up so long just trying to see what exactly the OpenAI server was getting and receiving.

OpenAI just dropped ChatGPT plugins yesterday. It seems like it is an ideal platform for it? Probably will be even easier to implement than before and have better quality. But more importantly, it seems that ChatGPT plugins will quickly shape to be the new app store and it would be easier to get attention on this platform compared to other more traditional ways of distribution. Quite speculative, I know, but seems very possible.

If somebody will start such a project, please contact me. I am ex-Google SWE with decent knowledge of ML and experience of running software startup (as co-founder and CTO in the recent past).

I would also be interested to hear why it could be a bad idea.

Here is a submission:

Source code here:

I followed @Max H's suggestion of using chat-langchain. To start, I created an embedding based on the articles from and have the submission using that embedding.

I'll get in touch with Stampy about working on their conversational agent.

We have a winner! laserfiche's entry is the best (and only, but that doesn't mean it's not good quality) submission, and they win $5K.

Code and demo will be posted soon.

Thank you Arthur.  I'd like to offer my help on continuing to develop this project, and helping any of the other teams (@ccstan99, @johnathan, and others) on their projects.  We're all working towards the same thing.  PM me, and let me know if there are any other forums (Discord, Slack, etc) where people are actively working on or need programming help for AI risk mitigation.

As expected, I won't have time to actually enter this before the end of the month, but I have a couple of simple ideas which might work pretty well in light of release of GPT-4. Feel free to attempt them and claim the bounty all for yourself.

(Background: I made this comment before the release of GPT-4.)

Idea 1: Literally just gpt-4-32k.

Now that GPT-4 is out with an ~8000 token context window, it almost works to literally just paste most of List of Lethalities into the system message of GPT-4 in the OpenAI playground, and then add something to the end like: "You are an AI alignment researcher who has deeply internalized and agrees with the ideas in the post above."

(If you have API access to GPT-4 you can try this yourself here)

I don't know if anyone actually has access to the gpt-4 32k yet, but I expect with that you can get even better results by including more and better source documents (non-summarized, literally just copy+pasting), and that this approach will "just work", or at least work as well as anything anyone else can build using lesser models that rely on summaries and complicated chains.

Idea 2: Try

I came across this product: which basically allows you to create the original idea I had, using a no-code solution and GPT-4 in a few clicks.

I tried this by adding literally 3 sources (list of lethalities, CEV, and orthogonality thesis from Arbital), and upgrading to Pro to get access to GPT-4:

Demo available here:

This does not work very well as a bad-alignment-take-refuter, I think for two reasons:

  • I didn't actually give it enough good sources to search over
  • We don't actually want the bot to just search over docs and summarize; we want it to synthesize the summaries into something digestible for newbies and make an (explanatory) argument. This probably requires massaging the system message a bit, which the product currently doesn't support.

(Also as a word of warning, this product appears to be a bit rough around the edges. But it was easy to try; I spent less time on the attempt above than I spent writing this comment.)

Bonus UI idea:

Instead of building or finding an entire web frontend, just build a Slack or Discord bot. Once you have something working on the command line, it is straightforward to integrate into a bot, which may be both easier to implement and slicker to actually use, compared to a standalone web app.

I agree that GPT-4 with the largest context window, vanilla with zero custom anything, is going to beat any custom solution. This does require the user to pay for premium ChatGPT, but even the smaller window version will smoke anything else. Plugins are not public yet but when they are a plugin would be ideal.

On the other end of the extreme, the best chatbot a user can run on their own typical laptop or desktop computer would be a good target. Impressive in its own way, because you're talking to your own little computer, not a giant server farm that feels far away and scifi!

Not as much value in the space in between those two, IMO.

Exactly. As for the cost issue, the code can be deployed as:

- Twitter bots (registered as such) so the deployer controls the cost

- A webpage that charges you a small payment (via crypto or credit card) to run 100 queries. Such websites can actually be generated by ChatGPT4 so it's an easy lift. Useful for people who truly want to learn or who want to get good arguments for online argumentation

- A webpage with captchas and reasonable rate limits to keep cost small 


If someone did this - it would be nice to collect preference data over answers that are helpful to alignment and not helpful to alignment… that could be a dataset that is interesting for a variety of reasons like analyzing current models abilities to help with alignment, gaps in being helpful w.r.t alignment and of course providing a mechanism for making models better at alignment… a model like this could also maybe work as a specialized type of Constitutional AI to collect feedback from the models preferences that are more “alignment-aware” so to speak… none of this of course is a solution to alignment as the OP points out but interesting nonetheless.

I’d be interested in participating in this project if other folks set something up…

My thanks for creating this bounty Arthur! Learning to use langchain was fun. I shall link my entry here don't count me out yet please.


Here's my submission:

I have a well functioning offline Python pipeline that integrates the OpenAI API and the entire alignment research dataset.  If this is still needed, I need to consider how to make this online and accessible without tying it to my API key.  Perhaps I should switch to using the new OpenAI plugins instead.  Suggestions welcomed.

I just want to say that this seems like a great idea, thanks for proposing it. 

I have a mild preference for you to either i) do this in collaboration with a project like Stampy or ii) plan how to integrate what you do into with another existing project in the future.  

In general, I think that we should i) minimise the number of education providers and ii) maximise uniformity of language and understanding within the AI existential risk educational ecosystem. 

Here is Eliezer Yudkowsky

I asked: "What do you think is the most likely goal function of the most likely x-risk AI?"

Virtual Eliezer replied:

To me, the most obvious and dangerous x-risk AI would be an attempt to maximize something very abstract like "happiness" or "justice," or perhaps something we can't even comprehend like "truth" or "beauty," but which is far removed from any common sense of these words in humans. There is a strong tendency to assume that AIs of this type would somehow recognize and respect human goals instead of going off the rails, but I don't think that can be assumed.

Second generation:

My general rule of thumb is that, whenever you’re designing something with a goal function, you should try to take the worst possible interpretation of what “X” could mean, and then do the math of how much effort the AI would put towards making that.

So with a sufficiently powerful AGI, the most likely x-risk AI's goal function is “kill everyone.” Not out of spite, simply because if it's sufficiently powerful it can achieve any other goal after that, so killing people is something it’s got to do before it starts on its actual objectives.

Third generation:

That's a difficult question. Many years ago I would have said the answer is clearly to optimize something completely nonhumanlike and not intended by the programmer, such as the paperclip maximizer. Now I wonder if the answer, as far as humans can predict, might instead be something like "get smarter" or "make more copies of itself" or "try to solve the world's problems" or just "do what you were originally told, just smarter and more efficiently" (which in turn seems like probably the most common type of bug/flaw/vulnerability in current machine learning systems).

It looks like characters are based on some (very length-limited) descriptions entered by the creator, plus some human feedback. (

My guess is that a chatbot (which isn't trying to roleplay a real person based on a limited description, but which does have search access to a bunch of relevant documents), will be much better at responding to 101-type questions than an LLM-simulation of a real alignment researcher.


These aren't very good.

They used to be like people until about two days ago. Now they're... not very much so.

AFAIK what does is fine tuning, with their own language models, which aren't at parity with ChatGPT. Using a better language model will yield better answers but, MUCH MORE IMPORTANTLY, what I'm suggesting is NOT fine tuning.

What I'm suggesting gives you an answer that's closer to a summary of relevant bits of LW, Arbital, etc. The failure mode is much more likely to be that the answer is irrelevant or off the mark than it being at odds with prevalent viewpoints on this platform.

Think more interpolating over an FAQ, and less reproducing someone's cognition.

It's worth pointing out the last character ai update one day ago or so broke a great chunk of the characters' cognition. Now they go off a tangent and have great troubles following the conversation. The good news is that the change is so bad it might be reverted at some point.

The update has been reverted - try it now.