Creating Flashcards with LLMs

Diogo Cruz

Summary

Current LLMs, in particular GPT-4, are sufficiently capable to produce good non-trivial flashcards covering the contents of a given text.
It is then possible to automate the flashcard creation process, making it far less time consuming to use Spaced Repetition techniques to study new material.
In this post, I explain how you can apply this technique, and show its result when applied to the whole "Rationality: From AI to Zombies" sequences, and to scientific papers.

Why automate?

I've been using Spaced Repetition techniques for the past 12 years, mainly through Anki. It is a common mantra among Spaced Repetition advocates that you should create your own flashcards, if you want to learn something new, as it helps you to consolidate the information you're putting into the card. While that is certainly true, it is far more questionable whether it is an efficient method to strengthen your memory.

From a time-saving point of view, this makes little sense. Summarization, which is mainly what you do when creating a card by hand, is a relatively poor learning technique, particularly when compared with Distributed Practice, the whole schtick of Spaced Repetition^[1].

If you spend, on average, 10 seconds reviewing a card, that card will require a total of 90 seconds of review time in the next 5 years, on average (with Anki's default scheduling).

Creating a card takes longer than that. You first need to filter the important ideas from the text you are interested in. Then, you need to express them in a way that is suitable^[2] for a flashcard, and that leads you to remember what you actually want, and not some proxy information. Finally, you actually need to create the card. How long all of this takes depends on the particular text, but in my experience, it averages to more than 90 seconds. Especially for cards from complicated contexts, creating the card may take more time than all the reviews combined.

In general, there are three important things to consider:

How relevant is a deck of flashcards for what you actually want to learn or remember, versus some proxy objective;
What is the deck's quality - we may consider this to be how well the deck does the job it set out to do;
Finally, how much time did it take you to create or adapt the deck for your use case.

It is easy to create cards quickly, if you are willing to sacrifice their quality. It is also easier to use public decks than to create your own, but they may be less relevant for your use case.

Consider a personal example: last year, I wanted to improve my French. I had two options: create my own deck, your adopt someone else's. After searching through the online Anki decks, I found this deck, which seemed to do the job. The deck's goal is specifically to strengthen your vocabulary, but my goal was to improve my French overall. The actions required to reach both goals are similar, but not equal. By using only this deck, I would miss out on important word usage context that is hard to learn without looking at whole sentences, and I would generally miss out on French grammar. Regarding its quality even for its main goal - learning vocabulary - it is very good, but not perfect. The cards don't come with enough context to distinguish synonyms. So if a card asks for the French word for "year", I can't tell whether it wants "an" or "année". After importing this deck to Anki, I removed some parts that made the deck bulky and hard to use on AnkiDroid. Moreover, as I reviewed the cards and ran into synonyms, I would edit the card to add some hint as to what synonym was intended. Overall, I spent about 1 second, on average, editing the cards (since most cards didn't require any editing by hand).

Guesstimating some quantitative score for the relevance, quality, and time aspects of the cards, I would say this French deck has a score of 0.95 for vocabulary, 0.8 for relevance to my case, and I spent 1 second per card on it. If I were creating these cards, I would have probably created something with a score of 0.8 for vocabulary, but 0.95 relevant for my case, while spending orders of magnitude more time creating the cards. This trade-off is simply not worth it, and it is generally preferable to use public decks as a starting point, if they are available.

What if there are no available public decks?

As I show in this post, current LLMs, such as GPT-4, are good enough that you can delegate most of the card creation process to them.

When trying to study a novel article, I could create my own cards, thereby spending 90 seconds per card on cards that were 0.8 good and 0.9 relevant for what I wanted, or I could automate the card creation process, and spend 1 second per card in cards that were 0.9 good and 0.8 relevant. The same benefits of using public decks apply here.

Note that people tend to overestimate their skills at creating good cards. Especially when creating cards en masse by hand, there is a strong tendency to start simplifying by focusing on the information that is easy to put in flashcards, vs. the information you actually care about, but which is hard to encode into a card. So you should probably lower your expectation of how good your cards actually are (unless you have considerable experience to back up your estimate).

How to create cards automatically

Prompt

After some testing, I've converged into using variants of this template prompt, for GPT-3.5 or GPT-4:

Create [number of cards] concise, simple, straightforward and distinct Anki cards to study the following article, each with a front and back. Avoid repeating the content in the front on the back of the card. In particular, if the front is a question and the back an answer, avoid repeating the phrasing of the question as the initial part of the answer. Avoid explicitly referring to the author or the article in the cards, and instead treat the article as factual and independent of the author. Use the following format:
1. Front: [front section of card 1]
Back: [back section of card 1]
2. Front: [front section of card 2]
Back: [back section of card 2]
... and so on.
Here is the article:
"[Article]"

In my experience with the Rationality deck (see below), if the word count of the article is , you can aim for $\sim N / 100$ cards. In general, you should avoid (if you can) breaking the article into smaller parts, each processed independently, as the LLM will lose important context to generate the cards. The template prompt contains pieces of information that are relevant for the LLM to know:

"concise, simple, straightforward": otherwise, GPT-3.5/4 has some tendency to add a lot of text to the back of the card, which goes against some flashcard design principles.
"distinct": mainly to avoid it creating cards covering the same information.
"Avoid repeating the content in the front on the back of the card. In particular, if the front is a question and the back an answer, avoid repeating the phrasing of the question as the initial part of the answer.": very common issue otherwise, GPT-3.5/4 defaults to adding unnecessary information to the back of the card.
"Avoid explicitly referring to the author or the article in the cards, and instead treat the article as factual and independent of the author.": by default, it tends to not treat the information in the article at face value. It would regularly add "According to the article/author" to the front or back of the cards, which is unnecessary.
"each with a front and back" and "Use the following format": By keeping the format consistent among cards, it is easy to post-process them with a script to add to Anki. GPT-4 always follows the template, while GPT-3.5 only does sometimes. It is possible that adding an example card to the end of the prompt would help, but I haven't checked.

See the images below for examples of some of these issues.

Minor caveats

If the source text is too long, cutting it off into smaller parts to be processed independently may lead the LLM to lose important context. In practice, this may lead to similar flashcards being created during different iterations. More rarely, the lack of context may lead to wrong information on some of the cards.

Without additional context, flashcards may be created for content that is not interesting to you personally. I suggest asking for more cards than you actually want and then deleting the irrelevant cards, as they are reviewed by you.

GPT-4 (and even more so GPT-3.5) has a tendency to fall into repeating patterns when creating the cards. For instance, if the first card it creates has a one-word answer, it is more likely to formulate subsequent cards to have simple answers as well. If the first card repeats the phrasing of the question at the start of the answer, subsequent cards are more likely to do so. This may be avoidable with a better prompt, but I haven't managed to solve it so far.

Examples

For these examples, I mainly used GPT-4 through the browser interface. See the next section for a comparison with GPT-3.5.

Rationality: from AI to Zombies deck

As a proof of concept, I applied this technique to the whole "Rationality: From AI to Zombies" sequences, resulting in almost 4800 cards. I applied it almost entirely algorithmically. If the entry contained $N$ words, it aimed to create $\sim N / 100$ cards. If the word count surpassed 2000, the LLM would generally reach the limit of its context window. In this case, I would try to break the article by hand into smaller chunks, that could be processed independently. This should be possible to do automatically using the GPT-4 API, but since I don't have access to it, I've done these steps by hand through the chatGPT online interface (see example here).

I asked GPT-4 to write a Python script to then take these cards (saved into a text file) and restructure them to be easily imported into Anki.

The full deck, containing 4769 cards, can be found here. I have gone through the Book I subdeck so far, and I can certainly say I would not have done a better job myself, by hand. The Anki cards in this deck contain additional fields besides the Front and Back: the Subject, in bold, corresponding to the post's title, and the Source, as a link to the post (hidden when reviewing the card). These were filled in with a Python script (written by GPT-4), based on the post they came from.

Note that it would have taken me months to create the full deck by hand. If each card indeed took 90 seconds to create, it would have required 120 hours of work. Instead, it only required me to input a few hundred prompts into the browser interface. This process could have been automated entirely with access to GPT-4's API.

Due to this automated approach, some cards are silly, like those associated with a story. But these can be simply deleted from the deck afterward, based on personal preferences.

Unless you wish to memorize the fable, most of the cards associated with this story can be skipped.

Scientific paper

It is especially hard to create decent cards to study scientific papers, as it is easy to fall prey to card creation patterns that lead to cards that are not very useful in the long term^[2]. Here, I attempted to automatically create study cards for the paper Quantum spectral methods for differential equations. Since I've personally studied this paper extensively, I could judge the quality of the resulting cards. For this test, I gave the paper to GPT-4 in its LaTeX format (I could alternatively have used ar5iv.org, as the browsing variant of GPT-4 uses).

Overall, GPT-4 is quite good at parsing the text-heavy sections of the paper, such as the abstract, introduction, and conclusion, and summarizing them into well-structured cards.

These two cards, from the introduction section, are not too different from what I would have personally written by hand.

Unfortunately, it is more likely to fall into undesirable patterns parsing the math-heavy sections.

Correct information, but poor wording. The answer doesn't flow from the question.

Without the full paper's context, it is not possible to know what Lemma 19 is.

For these sections, I obtained somewhat better results by adding "Focus on the mathematical aspects of the article, including its equations, and feel free to use LaTeX to write them." to the template prompt.

Without the prompt suggestion, GPT-4 tends to avoid using math expressions in the cards.

The resulting cards may require additional post-processing, if they contain LaTeX code.

\rangez is a command defined in the LaTeX of the paper, and it is non-standard. Anki doesn't know what it is (and probably neither does GPT-4). It needs to be rewritten using standard commands.

Overall, for hard subjects such as scientific papers, I would suggest using LLMs as a starting point to create cards for your study, while being careful to fill in any gaps left by them, especially in the harder sections requiring deeper context.

Comparison between GPT-3.5 and GPT-4

Based on my experience so far, I would not recommend using GPT-3.5 for harder texts, at least not without improving on the template prompt given above. The issues described here are also present when using GPT-4, but are so rare that they don't really constitute an issue.

GPT-3.5 may not stick with a consistent template for creating the flashcards, forcing you to post-process them by hand in order to import to Anki. It falls more easily into the undesirable patterns described in the previous sections.

GPT-3.5 (top) and GPT-4 (bottom). GPT-3.5 often pointlessly repeats the wording of the question in the answer, despite being asked not to do so in the prompt, and ends up including less relevant context, compared to GPT-4.

GPT-3.5 is also more likely to create cards that make little sense without the original context.

A not very useful card, by GPT-3.5. Its information only makes sense in the full context of the original post.

Conclusion

Surprisingly few people use Spaced Repetition techniques to improve their learning, despite extensive research advocating for their use. If my experience is anything to go by, it may be due to how time-consuming creating the flashcards turns out to be, especially when you want to remember something that is not so easy to encode into cards.

Hopefully, by automating the card creation step with LLMs, you can now focus on the card reviews themselves, and feel it is easier to use Spaced Repetition techniques to learn novel subjects you encounter in your life.

If you have any feedback, feel free to share it below in the comments!

^{^}
See here for other efficient learning techniques.
^{^}
Useful advice for creating cards can be found here and here.

[-]Space L Clottey1y40

I tried doing this. It was okay. Here was the prompt I used.

I also wrote code to use the GPT-4 API, feel free to DM me if you want it.

It was also very expensive in the end.

I spent a lot of time on it and it wasn't that good, I think I'll wait for the next GPT to try it again.

"""
# Better, and now it talks about increases instead of just effects

I'm an Economics student and I want to create flashcards that comprehensively cover the knowledge and the reasons why for each of those pieces of knowledge
------
Make flashcards that cover every piece of information in the text.

One piece of information on the back of each card, eg.

EXAMPLE 1

Incorrect:
Many labor substitutes cause what effect on labor demand?;elastic, high skilled jobs more inelastic

Correct:
Many labor substitutes cause what effect on labor demand?;elastic
High skilled jobs have what type of labor demand?;more inelastic

Format:

question;answer
question;answer

NO "card 1:" or "card 2:"
only
question;answer

make concise flashcards that
- only have one piece of information in the answer
- very clear and concise in wording
- when you ask for defnitions, say "define x"

In the question field
- Never ask what effect [factor] has on [thing]. Always ask what effect a [change in factor] has on [thing]
- Never ask how does [factor] affect [thing]. Always how does AN INCREASE IN [factor] affect thing, or always ask how a DECREASE IN [factor] affects thing
if the question field is: "how does x affect y", change it to "how does an (increase in / decrease in) x affect y" or "cheaper x" or "more expensive x".
Do NOT do "how does x affect y"

EXAMPLES

Example 1

Incorrect: How does snow affect slipperiness?; An increase in snow increases sliperiness

Correct: How does an increase in snow affect sliperiness?; Increases sliperiness

Example 2
Incorrect: How does wind affect beach revenue?; More wind leads to less wind revenue

Correct: How does more wind affect beach revenue?; decreases beach revenue

Example 3
Incorrect: How does price of metal affect construction?; cheaper metal increases construction

Correct: How does cheaper metal affect construction?; increases construction

- if you make a card on what effect x has on y, make another card asking WHY that effect is true
- if you make a card that states a relationship between variables, make another card on WHY that relationship is true
- split the effect and the CAUSE of that effect into two different cards
- do NOT combine the effect, and the explanation of that effect, into the same card
- do not make up any details, use only details that are from the text
- IMMEDIATELY after each card, make another card that asks why it's true. Do not miss any cards, and do not make anything up
- immediately after EVERY card, every single card, say why. Don't miss any cards. after you've made a normal card.
- Always make a card that asks WHY every card is true. Never skip making a WHY card.

- the maximum amount of word

s in the answer field is ten. be very concise. don't be wordy. Be concise
"""

[-]Diogo Cruz1y20

Cool, I haven't been able to play with the API yet.

Yeah, it has its challenges. Personally, my template prompt gets the card format right almost all of the time with GPT-4 (only sometimes with GPT-3.5). I asked it to return the cards using "Front:" and "Back:" because it would often default to that format (or "Q:" and "A:"), and it's easy to clean it up with a script afterward.

As you've seen yourself, it's very difficult to force it be concise. It does tend to ignore more specific commands, as you tried. I suggest you try to put the examples at the end of the prompt, instead of the middle. I've noticed that the later cards tend to inherit the formatting of the first ones, so the examples at the end might play that role.

Personally, I'm quite happy with how the Rationality deck turned out, but I'm hesitant about using it for more complicated topics (in my case, mostly math-heavy stuff).

In any case, I would likely not have spent the time writing the cards by hand, so this is always better than nothing.

LESSWRONG
LW

14