Where do we begin to improve human thinking? Among diverse learning theories, Bloom’s Taxonomy[1] is a well-cited approach, categorizing learning processes into six hierarchical stages, ranging from simple to complex and concrete to abstract: Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. As such, Bloom's Taxonomy likely facilitates the in-depth learning of a concept, which we define as some abstract modular form of broader knowledge.

Now, if we want our LLMs to fully grasp some concept about this real world (i.e., we want generalization over memorization), perhaps we could borrow a framework like Bloom's Taxonomy to teach these concepts. Perhaps, now that we are dealing with some hierarchy in knowledge, we might also need some sort of a curriculum of teaching (i.e., curriculum learning). This post is intended to be a more talkative and conceptual walkthrough of our recent preprint, "Instruction Tuning with Human Curriculum."


Big Point 1. Unleashing the Base LLM Knowledge Through Human-Like Learning

Small Point 1.1. The Power of Progressive Learning

LLMs have come a long way, particularly in understanding and responding to human instructions. Yet, the way we train these models to follow instructions remains a bit like throwing them into a sea of information and hoping they learn how to swim. While this method has produced some incredible LLM systems, it’s not the most efficient approach. 

Humans have a structured educational journey. We start with fundamental subjects like basic arithmetic, slowly working our way up to calculus and quantum physics. Our learning is organized, gradual, and based on previously acquired knowledge. This is what education scientists (and Dr Bengio) refer to as "curriculum learning." So, what if we trained LLMs with a literal school curriculum?

Imagine a scenario where our model, which we'll fondly call 'Corgi,' enrolls in a digital high school, college, and graduate school. Corgi starts with basic math, literature, and sciences. As it progresses through its high school years, the complexity of the subjects it studies increases, along with its reasoning abilities. By the time Corgi 'graduates,' it is a robust, versatile LLM model capable of solving a wide range of problems with deep understanding.

Small Point 1.2. Learning the ABCs First

The core principle behind this idea is simple yet powerful: learning from simple to complex. In our experiments, we compared Corgi's performance when subjected to random learning versus structured, human-like learning. The results were more promising than what we thought. When Corgi learned using a curriculum inspired by human education, it significantly outperformed other methods, especially in tasks requiring world knowledge and commonsense reasoning.

To achieve this, we created a dataset that mimics a human educational journey. We used frameworks from international secondary education curricula and various university catalogs. The dataset covers topics that a student would typically learn in high school and university. It also contains questions of varying complexities, designed to test and build upon the LLM's understanding of a subject.


Big Point 2. How "Corgi" Mimics the Human Learning Process to Educate Machines

Small Point 2.1. The "Corgi" Framework: What it is and How it Works

Imagine a virtual teacher and a student; the teacher meticulously follows the entire educational curriculum, ensuring the student learns progressively and retains information. That's essentially what the "Corgi" model is about, but the student here is a our target LLM, and the teacher is a ChatGPT.

Corgi aims to make instruction tuning LLMs resemble the human educational experience by structuring the learning material following actual educational curricula (from UPenn, Penn State, Cambridge IGCSE) and implementing teaching strategies inspired by education science.

The first step is gathering a dataset that resembles a well-rounded curriculum. We sidestepped the challenge of creating such a large and diverse dataset from scratch. Instead, we used existing educational curricula and used teacher models like ChatGPT to automatically generate synthetic data that covers a broad range of topics from mathematics to philosophy. These generated data points are then fine-tuned and deduplicated, creating a dense map of human knowledge (subject to errors).

Small Point 2.2. The Role of Bloom's Taxonomy

As above-mentioned, Bloom's Taxonomy is a hierarchical model that outlines the different levels of human understanding, ranging from basic "remembering" to complex tasks like "evaluating" and "creating." Corgi incorporates this taxonomy when generating instructions for its training data. This ensures that the machine doesn’t just learn facts but also understands, applies, and potentially creates based on them, just like human students do.

Small Point 2.3. Curriculum Design: Blocking vs. Interleaving

In human learning, it's not just what you learn but how you learn it that counts. In traditional "blocking," you would study one subject exhaustively before moving on to the next. But psychological research shows that "interleaving," or mixing up different topics and revisiting them, leads to better retention and understanding. Likewise, we follow the interleaving curriculum to not just stick to one subject until it's exhausted; it cycles through different subjects while also progressing from simpler to more complex concepts.


Big Point 3. Curriculum Works for Instruction Tuning

Small Point 3.1. Diverse Benchmark Results I (Corgi vs Vicuna vs WizardLM)

We're using the LLaMA 2 13B model as a starting point, and then we're instruction tuning it. We are comparing Corgi against two other models: Vicuna v1.5 and WizardLM v1.2. Both of these models have also been fine-tuned using LLaMA 2, but they collect data in different ways. Vicuna uses real-world questions from people, and WizardLM creates its questions in a more sophisticated way.

In our experiments, Corgi outperformed the other models, and it did this even when trained on a smaller dataset. One of the significant findings was that the sequence in which you present learning material during this fine-tuning process matters a lot.

We noticed that if you "interleave" subjects and concepts while training (mixing them up but still keeping a sense of increasing difficulty), you get much better performance than if you simply "stack" them one on top of the other. These improvements were clearly seen across various benchmarks.

Why Does This Matter?

Well, there are a couple of reasons why our approach seems to work better. One reason is that when you have limited training time, using a structured approach like "interleaving" helps the model learn better and, more importantly, faster. This is particularly useful because we don't want to over-train and risk losing the model's ability to generalize to new situations. Another reason is that our approach seems to be more robust against "noisy" data—meaning it can still learn effectively even when the data isn't perfect. Such benefits of curriculum was also discussed in a previous literature[2].

Small Point 3.2. Diverse Benchmark Results II (How You Sort Training Data Has a Significant Impact)

The more you think about it, training an LLM to handle multiple areas of knowledge is a non-trivial problem. You have to consider how to introduce these topics to the model. We believe there can be two branches, and this mainly depends on how you progress the instruction data difficulty.

  • Global Curriculum: This is like going to a well-rounded school where you learn a bit of math, science, and literature every week. It aims to balance different types of challenges and subjects over time.
    • Interleaving: This globally balances cognitive challenges based on a well-known educational framework (Bloom's Taxonomy).
  • Local Curriculum: This is akin to immersion courses where you deeply learn one subject before moving to the next. It focuses on mastering one topic before introducing a new one.
    • Blocking: This locally focuses on one subject, making sure you get it before moving to the next.
    • Clustering: Similar to Blocking, but does not care about the order within a subject.
    • Spiral: Revisits old subjects but with new twists and challenges.

What We Found: Not All Curricula are Created Equal

  1. Global beats Local: Our experiments showed that the global approach was more stable and effective in general learning. Local methods can sometimes lead the model astray, making it harder for the model to generalize its learning.
  2. Order Matters: Surprisingly, the sequence in which data is presented can significantly impact how well the model performs. Poorly structured data could actually be worse than just randomly shuffling the training set.
  3. Beyond the Target Domain: Interestingly, a good training strategy doesn't just improve performance on the intended tasks (which was MMLU for us). It can also enhance the model's ability to reason, making it more versatile. This is also potentially coming from our use of Bloom's Taxonomy, which explicitly triggers the base model to learn how a teacher model reasons on the same concept.

Data Exemplars and Conclusion

In summary, our research introduces Corgi, a new approach to training large language models that draws inspiration from human educational methods. Think of it as teaching a language model like you would a high school student—starting from the basics and progressing to complex topics in an organized fashion. Our results show that this curriculum-based method outperforms traditional, random training approaches, offering better results in tests of reasoning and general knowledge. This underscores the importance of not just having a lot of data, but organizing it in a meaningful way, much like a well-planned syllabus in education. While our study is promising, more work needs to be done. For example, how do we effectively gauge the "difficulty" of a task for a machine compared to a human? And as these models get even larger, will the curriculum approach still offer the same advantages? These questions pave the way for exciting future research.

  1. ^

    Benjamin S Bloom, Max D Engelhart, Edward J Furst, Walker H Hill, and David R Krathwohl. 1956. Taxonomy of educational objectives: The classification of educational goals. Handbook 1: Cognitive domain. McKay New York.

  2. ^

    Xiaoxia Wu, Ethan Dyer, and Behnam Neyshabur. 2020. When do curricula work? In International Conference on Learning Representations.

New Comment