Notes on Algorithms

Menotim

This post records what I've learned while studying some of Introduction to Algorithms, 3rd and/or 4th edition,^[1] by Cormen, Leiserson, Rivest and Stein. Unfortunately, I didn't have time to study the chapters on graphs or the ones on miscellaneous topics before my arbitrary self-imposed deadline; maybe I'll read those some other time and write another post about them.

The post has attempts at explanations/notes on most of what I've read, though I mixed up the order some. I also implemented a few of the data structures and solutions to dynamic programming problems, and put the implementations on GitHub.

Complexity Analysis

The time complexity of algorithms is measured asymptotically, meaning up to a multiplier constant and for sufficiently big. I had already seen the terms for asymptotic complexity, but I kept confusing them, because there were so many (, , , , and ).

So here's what they mean: means asymptotically smaller than , means asymptotically bigger, means asymptotically equal, and and are strict versions of and , meaning functions that grow much slower or much quicker than the argument.

This means, for example, that we should say our algorithm runs in time, if we know our algorithm runs in exactly quadratic time; we say a problem is solvable in time, if the algorithm is quadratic time but we still hope to find a algorithm; and we say the problem needs time once we have proved that there is no way to solve it in linear time.

Here are the formal definitions:

if there are constants , such that for , .
if for all constants , there is a constant such that for , .
if , and if .
if and .

For example, polynomials of degree are . All logarithms differ by a constant only, so we can write without specifying the logarithm's base. Powers of logarithms grow slower than any polynomial: for all and , . This also means, for instance, that and , is between and .

By default, when analyzing an algorithm we try to find the complexity class of its runtime, as . If the runtime of an algorithm varies with input or the phase of the moon, we usually look for the complexity class of its expected runtime according to some distribution, or for that of its worst-case runtime over all conceivable inputs.

Recurrence Equations

When we write a recursive algorithm, usually the running time comes out as a recurrence equation that we have to solve. For example, in the Fast Fourier Transform algorithm we do a bunch of operations with our vector of length , that take a time , and also split it into two vectors of length and run the FFT on both of them. So if the time to run the FFT on a vector of length is , it satisfies this equation:

In my previous post I stated that this algorithm ran in time, with no proof. Our book gives two ways to prove it: the first is to prove it's and separately, both by induction: for example, in the case you'd use the inductive hypothesis to assume , and substitute above to write that , and work that until you find . The much easier way, when it works, is to use the master theorem.

The master theorem works when the recurrence equation is of the form:

for constants and . Each time we apply this equation, we split the work into equal parts of cost . The individual parts have cost when you've split times. At that point, you'll have parts. This is called the watershed function. The master theorem says this:

If grows polynomially slower than ( for ), then ;
If grows like , or faster than it by only (), then . So if and the watershed function grow at the same rate, then .
If grows polynomially faster than ( for ), and satisfies the regularity condition that for some , , then .

So, essentially:

When dominates by enough, the additional s don't matter, and the bottom layer of the recursion tree dominates the cost. The cost is .
When and are roughly comparable (equal or is a bit bigger), then all the layers of recursion are relevant, and you have to multiply the cost per layer (which is in this case), by the amount of layers .
When dominates , the recursion basically makes no difference, and only the top layer matters, so the cost is .

Amortized Analysis

A data structure comes with a set of operations that can be run on it. Sometimes the runtimes of these operations vary a lot based on the state of the data structure. We could try to find their expected or worst-case runtime, but usually it's more appropriate to find the amortized runtime, which is the worst case average runtime of that operation over all possible sequences of operations.

For example, suppose we have a stack with elements, which we can push one element into in a time or pop elements at once out of in a time (or if ). The pop operation runs in time, and could go up to , so popping is in the worst case.

However, its amortized time is much lower: we need to push an element at a time onto the stack, so if we take any sequence of operations, the largest amount of elements that could be popped is (by pushing one every operation then popping them all back out), so the worst-case average runtime is .

There are easier ways to do amortized analysis than to search through all possible sequences of operations to find the worst one. One way is the accounting method, where you change the cost of each operation so that some of them cost extra, which is stored away as credit, and some of them use up that credit by costing less.

In our stack example, we could say that pushing costs 2 time units instead of 1, pre-paying for when the pushed element will be popped out; and popping any amount of items costs 0 time, because we're using up the credit built up by the pushes. Since 2 and 0 are obviously both , both operations cost amortized time.

For this method to work you need to make sure not to borrow more than you stored up: if each operation has a real cost and we assigned it a virtual amortized cost , we need for all possible sequences. Does this help any? We still need to search through all possible sequences.

The book says we can do it by considering each operation to store credit into elements of the data structure or take it out from them, so in our example each push would "store" an unit of time in its pushed element, which would be "used up" by its subsequent pop. So we can guarantee not to use up more than we stored because each pop only uses what was stored by a push.

The book doesn't write the argument more rigorously than that, and to me it feels like doing so would just take us back to our previous argument. The accounting method might not that much with proofs, though it is very useful as a conceptual tool. However, there is a special configuration of the accounting method that does make proofs much easier: the potential method.

This method defines a potential function , that returns how much credit we have built up so far from the state of our data structure. So if operation turns data structure into data structure , the amortized cost is . The clever part is that if we add up the amortized costs, the sum telescopes:

So if we define the potential of our data structure's initial state as zero, , then all we need is for the potential to be greater than or equal to zero, for all states our data structure could reach, , and our amortization works. In our stack example, we could define the potential as the number of elements in the stack. This is obviously never negative, and leads to the same costs of pushing = 2, popping = 0 as before.

Sorting

A sorting algorithm takes a list of objects that have a defined ordering (for example, numbers) and returns a permutation of that list that is ordered (meaning one where smaller objects come before greater ones).

The most common sorting algorithms make no assumptions beyond these, and so can't do anything to their objects except compare them. These are comparison sorts. There are comparison sorts that work in time for a list of length , which is the best possible time. We can prove comparison sorts must take time like so:

Our algorithm must work for the sequences that have no equal elements. Suppose we are given one of those.
This means each comparison has only two possible results.
So if we make comparisons, there are possible sequences of outcomes, and since our algorithm is deterministic we will have at most possible return values.
There are possible orderings, each with a different correct return value. We must be able to return each of those, so we need
Taking the log of both sides, and using Stirling's approximation, we have

If each comparison takes constant time, this means any permutation sort runs in .

Comparison Sorts

The book describes four different comparison sorts:^[2]

Insertion sort works by induction. At the start, the first 1 element of the array is sorted. At each step, the first elements of the array are sorted, so we take element , and move it backwards, swapping it with each of the ones before it, until the one before it is less than it. After that the first elements are sorted. We repeat for every from 1 to . The runtime is . This is a terrible algorithm.

Merge sort works by divide and conquer. First, if our list has more than 1 element we divide it in two equal parts. Then, we run merge sort on each part. Then we merge the two: we make a new list, make a pointer into the start of each part, append the least of the pointed-to values, increment that pointer, and repeat. Since the two individual parts start sorted, at the end the new list will be completely sorted. Merging takes , so the runtime follows the recurrence equation , which we already know has the solution .

Heap sort works by using the max heap data structure. A max heap is a kind of priority queue: you can turn a list into a max heap in time, and extract the biggest number in it in time. So you just make a max heap and then extract all the values in order, which takes time. I will explain how max heaps work in the data structures sections.

Finally, quick sort is kind of like a reverse merge sort: instead of sorting two halves of the array individually and then shuffling their numbers in order into one array, we first rearrange the array into two halves where all the numbers in the first half are lower than all the numbers in the second half, and then we sort each of the halves individually.

This rearrangement is called a partition. We do it by picking one element in the array called a pivot, and putting all the elements smaller than the pivot to the left of it and all the larger ones to the right. The runtime of the algorithm depends on what we pick as the pivot: the closest it ends up being to the list's median value, the faster the execution.

If the element you pick as the pivot has the same chance of being in any position in the order, the expected runtime of quick sort is . Even if the pivot is always in the 99th percentile, the expected runtime is still . However, it can be worse than that.

For example, if you pick the pivot in the simplest possible way, by always picking the first element in the list, then in the common case where the list starts out already sorted no elements will be to the left of the pivot, and the runtime is . There are cleverer schemes, but an enemy could always pick the worst possible ordering to make the time .

The solution is to randomize the algorithm: at each point we pick the pivot at random, so no matter what the enemy does each element will in fact have the same chance of being picked! This guarantees our running time, if we're not exponential-in- levels of unlucky, and in practice usually runs faster than either merge sort or heap sort.

Other Sorts

If you have some kinds of additional information, you can sort a list in time.

For example, if you know your list will only contain integers in the range from 1 to , you can sort them in time with counting sort, called that because it counts how many there are of each number, uses that to count how many elements are less than or equal than each other, and then puts each of the numbers directly in its correct position in the output list.

As another example, if your probability distribution for the list's elements happens to be independent and uniform between 0 and 1, you can sort them in expect time with bucket sort: create buckets (linked lists), put the numbers in the interval into bucket (from 0 to ), and then sort each bucket with insertion sort. Since the buckets are so small those insertion sorts run in expected time.

If your distribution is not uniform between 0 and 1, but is still the same for each element, continuous, independent, and computable in time, you can still run bucket sort, by sorting each item according to the probability that another item will be less than or equal to it.

Data Structures

Sequence Data Structures

A data structure is a way to store a bunch of objects which lets you do certain operations to them efficiently. The simplest way to store objects is in a sequence, with which you can know how many there are find the first, last, and in general nth object, find the object that comes before or after a given object, and insert or delete new objects. There are two basic kinds of sequences, the array and the linked list.

An array is just all the objects in the sequence stored contiguously in memory. You can easily find the th object by taking a pointer to the start of the array and adding , where is the size of each object. (We zero-index objects, so the first element is element 0, fundamentally so that this formula can work.) This takes time.

The problem is inserting new objects. If you know you won't ever store more than objects, you can allocate that amount of memory for the array and gradually fill it up, but if you go past that limit it's over, you need to allocate an entire new array with at least slots and copy your old one, which takes time.

In a linked list, instead, you store each element in the sequence as its object plus a pointer to the next element. As long as you keep track of the first element, you can go through the list by following those pointers (the last element in the list will have a NULL next-element pointer). You might also give the elements a pointer to the previous element so that you can go backwards, creating a doubly linked list.

You can easily insert object at the start of the list in time by making a new list element holding that object, making its next-element pointer point to the previous first list member, making that element's past-element pointer point to the new element if you have those, and keeping the new element as the list's new first element. The problem with linked lists is that finding the th object takes time, much worse than the array's .

Linked lists also suffer from performance issues because since they're not contiguous, if you're using one it's harder for your computer to keep it in the cache, the smaller amount of faster memory storage where your computer automatically tries to keep frequently-accessed memory.

Luckily, there's a third sequence data structure, which solves all of these problems with the power of amortized analysis: it's contiguous and easy to both index and expand. This is the dynamic array.

To use a dynamic array we first initialize a regular array with some empty space for objects. We store a pointer to the array's start, the amount of space in the array (the dynamic array's capacity), and the amount of objects already in the array (the dynamic array's length).We find elements by index just like in a normal array, and as long as the capacity is less than the length, we can insert new elements by putting them after the previously last element and increasing the length.

The clever part comes when the array is full (length=capacity). Instead of allocating a new array with 1 extra slot of capacity, which would run every time we appended another object and take time to move the old array to the new one, we allocate a bigger array with lots of empty space, that has capacity equal to the previous capacity times a constant (usually 2x or 1.5x). This still takes , but it doesn't run very often, so the amortized cost is small.

Suppose our initial array has capacity 1, and every time our space runs out we allocate a new array with twice the capacity. Then we will allocate a new array on the 2nd insertion, the 4th, the 8th, and etc., with a total cost of for insertions. This is roughly twice the amount of insertions, so the amortized cost will be !^[3]

Hash Tables

A hash table stores a set of objects, each of which has a key. The main operation it supports, besides from inserting or deleting objects, is searching for the object with a certain key. You might expect we'd need time for a search, because that's the amount of bits we'd need to pin down the key among all keys, but actually we are able to bring down the time taken by a search to . How? By re-calculating exactly where the key was put and going there directly.

We do this by using a hash function. We will store our objects using an array with slots (hopefully ). The hash function takes a key and returns an index from 0 to . We store the key's object at that index, and search a key by calling the hash function on it again and going to that index. Hash functions are usually computable in , so both insertion and search are , we're done!

Well, not quite. Unless there are less than possible keys, our hash function cannot possibly be injective, so some keys will hash to the same index (these are called collisions). So instead of just storing one object at each index, we will instead store a linked list of all objects that hash to this index, and search through the list when we want to find a key. We will then try to ensure that the lists are small enough that searching through one of them still takes time.

We could try to find a clever statistical method that predicts the keys and puts keys that are likely to appear together in different spots. If we do that, however, an attacker could look at our scheme, find which keys collide, and input those, resulting in a single long list that takes to search. Also, it sounds hard. So we will instead take as our goal to find an unpredictable hash, one where the keys have the same chance of hashing to each slot and, more important, their distributions are independent.

If we manage that, what's the expected runtime? We'll define the load factor , which is the expected number of objects per slot. A search takes time to hash the key, plus the time needed to search for it. If the key is not actually in the hash table, we have to search through all the members of its list, which are in expectation, so the total expected time (with hashing) is . If it is on the table, there will be on average objects on its list; we need to search through half of them on average, plus the one we're looking for itself, so in expectation. This means the total expected time is in this case as well.

Since linked lists are not contiguous, which is inconvenient because of caching, we often use open addressing, where instead of making lists at each slot, we store one object in each, and if a slot is full we just find another empty slot to put the object in.

The simplest way is linear probing, where we start at the key's hash and move forward through the list (probing it) until we find somewhere empty, and put the object there. To search for an object we start at its key's hash, then move forward until we find either the object or an empty slot. For this to work we need plenty of empty slots, so should preferentially be significantly less than 1.

Other methods are quadratic probing, where the slots you probe increase quadratically (for example , , , etc.) and double hashing, where the slots you probe are calculated using another hash function (like , , , etc.).

Finally, let's look at some hash functions, assuming our input key is an integer. The most obvious way to map to one of slots is by taking the modulo (the division method): . This works reasonably well if is a prime and not close to a power of two, but it's predictable.

Alternatively, we could take the fractional part of , maybe after multiplying it by some constant , multiply that by , and take the floor (the multiplication method): . This is less predictable because we can vary the constant , and works for any , but it needs floating point operations.

A similar method that works with integers (and so is faster) is the multiply-shift method. Suppose our integers are bits long, and multiplying two of them truncates the resulting bit integer, taking only the first bits. We multiply our key by a -bit integer , and then take the highest bits, where is a power of two:

where is the left shift operator. This is pretty easy to compute. If you pick at random, it's also good at randomly spreading out the keys: this family of hash functions is -universal, meaning no two keys have a more than chance of hashing to the same slot. The book recommends you use this hash function. However, if at all possible you should use or copy someone else's implementation, because there exist much better and more secure hash functions.

Max Heaps

A binary tree is a collection of nodes. Each node has a parent, a left child and a right child. The descendants of a node's left child are its left subtree, and the descendants of its right child are its right subtree. The only node that doesn't have a parent is the root; any node might not have either of its children. Nodes can also contain an object with extra information.

The most obvious way to represent a tree is as an object with a pointer to its parent and pointers to its children, which are NULL when those children don't exist. We will represent binary trees that way for binary search trees and red-black trees, but for max heaps we will store a binary tree in an array.

In this array (0-indexed), the children of will be and , so the array is the result of reading the values of the tree in order from left to right, top to botttom. All the layers of the tree must be filled, with each node having both children, except for the last layer, which is filled from the left to a point. This means all the indices in the array from 0 to are occupied. It also means the height of the tree is , since each layer can fit nodes.

The objects we store in a max heap will each have a key like in a hash table. The basic operations we will need to support for a heap will be inserting objects and extracting the object with the largest key. To ensure we can do this efficiently, we will require the tree to follow the heap property, which is that each node has to be bigger than both its children.^[4] This means the biggest element in the heap is always the root.

As for how we use it, roughly: to insert an element in the heap, we append it to the end of the array and then let it "flow up" the tree, swapping it with its parents until its parent is bigger than it and the heap property is satisfied. When we extract the maximum element, we put the array's last element in its place, then let it "flow back down", swapping it with its biggest child until the heap property is satisfied. Since the tree's depth is , both of these take time.

This "flow down" process is also used when turning an array into a heap. We go from the bottom up, letting each element in the tree "flow down", which makes the subtree that was under that element follow the heap property (if both of its children's subtrees follow it, which they do because we ran the "flow down" on them first).

The flow down process takes because it swaps once per level of the tree, and we do it times, so you'd think this takes ; but there's a tighter bound, it actually takes because of how most of the nodes are in the last levels of the tree.

Binary Search Trees

A binary search tree is the use of a binary tree to store a list in sorted order. It's sometimes useful to store a list in sorted order, for example if you'll want to find all elements between a start value and an end value. However, if you store a sorted list into an array, inserting a new element takes time, because you have to push all the elements after it forward. Binary search trees are a more efficient alternative.

A binary search tree is a binary tree (stored in objects with pointers as described above), that require their nodes to follow the binary search tree property: the key to a node must be less than or equal to the keys of every node in its left subtree, and greater than or equal to the keys of every node in its right subtree.

This means that to find a key you compare it to the root, if it's less you go to its left child, if it's more you go to its right child, and repeat until you find the key or get to a NULL (in which case the key isn't in the tree): binary search, hence the name. This takes where is the height of the tree, the amount of nodes in the longest path from the root to a NULL.

To insert a new key into a BST, you do the same binary searching for it and when you get to a NULL (where it would be), you insert it in place of the NULL. Deleting nodes is more complicated, since you have to find new parents for its children, which is easy if it has only one (just attach it to its grandparent), but if it has two, you have to go down its right subtree, find the minimum key greater than it, and put that in its place. Both of these also take time.

Binary search tree operations take time proportional to the tree's height. We might hope the tree's layers will be mostly full and height grows logarithmically, . And if the nodes' order is random, this happens (in expectation)!

Unfortunately if, for example, your nodes keys are already sorted, such as 1, 2, 3, 4..., then 2 will be 1's right child, 3 will be 2's, 4 will be 3's and so on, and the height will be equal to . If we want logarithmic height, we can't just place the nodes wherever they fit; we have to rebalance the tree, breaking branches up when they get too long. This is where red-black trees come in.

Red-Black Trees

A red-black tree is a binary search tree where each node has a color, red or black.It is self-balancing: it guarantees that its height is always . It does this by requiring its coloring to follow these two properties (in addition to the binary search tree property):

A red node can only occur between two black nodes. This means the root can't be red, and a red node can only have black children. (NULLs are also considered black nodes.)
If you start at a node, and go down through its children until you reach a NULL, then for every path you take you will always pass through the same amount of black nodes. This is known as the original node's black height.^[5]

Since each branch of the tree has the same amount of black nodes, and can only have up to half as many red nodes (because you can't have two in a row), the longest branches are at most twice as long as the shortest. So it's easy to prove that the height of a tree of nodes is always .

To insert or delete a node from the tree, you need to do the same process as for a BST, and then readjust the colors and positionings to satisfy the red-black tree properties. This breaks down into a bunch of cases so I won't describe it fully here, but basically: you insert a new node as a red node, so it doesn't break property 2, but if its parent is red it does break property 1. If you delete a node, if it's red you're fine, but if it's black you might break property 1 and certainly will break property 2.

B-Trees

B-trees are a type of self-balancing tree that is used when you have very large trees and/or you need to store them in disk memory instead of RAM. Reading and writing to disk is expensive, so we want to minimize the amount of times we do that; however, we also assume RAM can only hold a small portion of the tree, so to go down a branch we need to read each node in sequence. This means we need the tree to be as shallow as possible, which we do by giving each node a large amount of children; if binary trees have a branching factor of 2, B-trees can have branching factors of over 1000.

Each node in a B-tree contains an amount of objects with keys, and children; analogously to binary search trees, if the th key is , the keys in the subree under the th child are all in the range . To search a B-tree, we see if a key is in the root; if not, we find its one child whose subtree must contain the key, and see if the key is there; if not, we look at its children, and so on.

The amount of keys per node is variable; the B-tree has a parameter named its minimum degree, and each node except for the root must have at least and at most keys. For example if , nodes can have 1 to 3 keys, meaning 2, 3, or 4 children; that structure is appropriately called a 2-3-4 tree. To insert an object into a B-tree, we find a leaf node with keys above and below its key (where the search for it ended) and insert the object as a new key in that node.

What if the node is full, with all keys? Then, before inserting, we split the node: separate it into the first keys, the middle key, and the last keys; then make new nodes with the first and last keys, promote the middle key to a child of that node's parent, and place the new nodes with keys as the children to the left and to the right of that middle key.

Of course, you can't do this if the node's parent is also full. To prevent this, as we go down searching for where to insert the new key, we preemptively split every node we find that has keys, so if a node's parent was full we already split it beforehand and can split that node as well. When the root gets full we put its middle key into a new node containing only it, which will be the new root; and the first and second half of its children in new nodes, will be that new root's two children. This is why we allowed the root to have less than keys.

What about deleting a key? First, suppose the key is in a leaf node, so there are no children under it. If that node has more than keys, we can just delete the key and be done; if it does have only keys we need to add a new one.

If that node's right sibling has over keys, we can move its leftmost key up into their parent, between them, move the key that was there in the parent down into our node, and turn the leftmost child of the right sibling into the new rightmost child of our node. If the left sibling has over keys, we can do the symmetrical thing.

If both siblings have keys, we need to merge our node and its sibling, also pulling down a key from their parent that was between the two; the opposite of the splitting process above. This only works if the parent has more than keys; so as we go down the tree we need to ensure the nodes we pass through have more than keys, like we did with insertion.

If the node with the key we want to delete has children, it's a bit more complicated because we need a new parent for those children. If the child to the right of the key we want to delete has over keys, we can find its minimum key (our deleted key's predecessor), delete that from its node (it will be a leaf node), and put that key in the place of our deleted key. If instead the child to the right has over keys, we do the same with its maximum key (our deleted key's successor).

If they both have keys, we need to merge the two of them, putting our to-delete key in the middle, and then delete our key from there (we have gone one step down the tree so we'll need to repeat this at most times).

Now that we know how to deal with B-trees, here's a fun fact: remember how I didn't fully explain insertion and deletion for red-black trees because there were too many cases to get into? There's a variation on red-black trees, called left-leaning red-black trees, which cut down about half of those cases without increasing time complexity by introducing a new constraint: if a (black) node has a red child and a black child, the red child has to be the left child. I won't need to describe how to implement those, because it happens that they are isomorphic to 2-3-4 trees!

A black node with 2 black children is equivalent to a 2-node; a black node with one (left) red child is equivalent to a 3-node; and a black node with two red children is equivalent to a 4-node. For more detail on how to implement either of these data structures, you can look at this presentation.

Problem-Solving Techniques

Dynamic Programming

A fundamental way of solving problems is divide-and-conquer: break up your problem into smaller subproblems, solve those, then combine the subsolutions into a solution for the entire problem. This can only be done if solutions for the subproblems do in fact make up a solution for the superproblem; if this is true we say the problem has optimal substructure.

In some problems, this leads to solving the same subproblems many times, when subproblems share subsubproblems. Dynamic programming means keeping a list of solutions for the subproblems you've already solved, so that if you come back to a subproblem you can simply look up its solution instead of re-solving it.

Well, not quite. That's how dynamic programming is usually defined, as generally and generically as possible, but here's how we usually apply it:

We frame the problem as a series of decisions, and look at one of the decisions . We pick an arbitrary possible value for that decision, , and assume the optimal solution makes that decision .
We say the problem has optimal substructure if, conditional on this assumption, we can break the problem into one or more subproblems, each with their own decisions, such that by solving each of the subproblems and combining their solutions with the decision , we get an optimal solution for the original problem.
Then we find those conditional solutions for every possible value of , and pick the best one; since a solution has to make some choice at point , one of them has to be optimal, so the best of those conditional solutions is the optimal overall solution.
And if there is a small (polynomial) set of potential subproblems that can arise from different choices of , by keeping track their solutions as they're solved we can avoid a combinatorial explosion that would result in exponential time, and solve the problem efficiently.

One example is the rod-cutting problem. Suppose you want to sell rods with integer lengths from 1 to centimeters. You are given a table saying how much a rod of each length will sell for. You start with one rod of length , and can cut the rod into as many smaller pieces as you want and sell them. How do you maximize revenue?

Here's the problem's optimal substructure: we consider the first/leftmost cut of the rod as a decision. If we assume this cut is centimeters from the left edge of the rod, the optimal solution consists of that leftmost -centimeter piece combined with the optimal way to cut the rest of the rod, with centimeters. ( might be equal to , in which case there are 0 centimeters left and we are done.) So, if the sale price of a rod of length is and the optimal revenue is , we have this optimal substructure:

We could implement a solution for the problem just like that, putting as the base case and a recursion for larger cases, but this would take exponential time because of how each subproblem solves subsubproblems. It goes much faster if we keep a table of for the values of we've already solved.

This can be done top-down, keeping the recursive implementation but adding a cache, and having each call look up its input in the cache, if it's found using it, and if it's not calculating the solution and storing it in the cache.

It can also be done bottom up, making an array to store the solution of every subproblem (starting with the solution of the base case ), and then finding and putting it in the table, then , then and so on. To find the solutions we use the solutions of smaller subproblems we found and stored in the array, so we don't need to recurse.

Solving the problem bottom-up is probably faster in this case, since it doesn't need to make recursive calls; however, in a problem where you'll only need to consider a small fraction of all possible subproblems, a top-down solution would be faster. Both of them have the same time complexity of , much better than the exponential time of the basic recursive solution.

We need to store the optimal value in the cache (to add to when comparing with the other choices of ), but we also need to store something to let us reconstruct the solution after we're done; it's not useful to say we can cut a rod in a way that makes $100 if we don't know what that way is.

In this case, we can store the optimal value of (the first cut) we found for each ; later we can reconstruct from the cache that the solution will be the optimal cut we found for , then the optimal cut we cached for a rod of length , then the cached optimal cut for a rod of length , and so on.

Greedy Algorithms

Greedy algorithms, just like dynamic programming, are used to solve problems where you need to make a series of choices to maximize some goal. However, unlike dynamic programming, which searches through every sequence of decisions efficiently by reusing results, greedy algorithms don't bother, and just pick whatever choice "looks best" at each point using a local heuristic. It turns out, in a lot of problems this works!

For example, suppose we are given a list of possible values for coins and are asked to make change, find the smallest set of coins with those values that add up to a total .

This is a classic dynamic programming problem: if we assume the optimal solution starts with coin , the rest of the coins in that solution will be an optimal solution for making change for a total . So we can solve it in time by making a table with the optimal amount of coins and first coin used for each change amount from 1 to .

There is also an obvious greedy algorithm: at every point pick the largest coin that won't make your total exceed . This algorithm doesn't work in general. For example, if your coin sizes are 1, 5 and 7, and you want to make change for 10 cents, the greedy algorithm picks a 7 and then has to end with three 1s, totaling 4 coins, more than the optimal solution of two 5s. However, for some sets of coin sizes, including for example {1, 5, 10, 25, 50, 100}, this algorithm does return optimal solutions!

^{^}
My college's library only had the third edition, but I also got and looked at PDF of the fourth edition.
^{^}
Insertion sort and merge sort are actually in chapter 2, but there was no reason for me to give them a separate section.
^{^}
We don't need to use more complicated amortized analysis methods here because since there's only one operation (insertion), there's only one possible sequence of n operations.
^{^}
There is also a min heap data structure, in which parents are smaller than their children and you can extract the least element. You can easily implement this by taking a max heap and multiplying all the keys by -1.
^{^}
We count the NULL at the end but not the origin node itself.