This is a "live" Fermi estimate, where I expect to make mistakes and haven't done much editing.

If you're working on a simple mathematical program, you can attempt to estimate the number of floating point or integer operations, compare that to statistics on CPU latencies/throughput, and use that to get a rough estimate.

Let's try estimating how long it would take to run a simple primality test, a modified Sieve of Eratosthenes:

function modified_sieve(n)
known_primes = [2]
for i in 3:2:n
mods = [i % p for p in known_primes]
if all(mods .> 0)
push!(i, known_primes)
end
end
return known_primes
end

In this algorithm, we store a list of known prime numbers. We iterate through odd numbers, and check whether any of them are divisible by a known prime, by calculating the modulus against every known prime in our list. If it's not divisible by any of them, we add it to our list of primes.

There's a lot we could do to come up with a better algorithm, but what I'm interested in is estimating how fast an ideal execution of this algorithm could be. How long could we expect this to run, on a modern CPU?

I expect, looking at this, that the vast majority of the time is going to be spent on mods = [i % p for p in known_primes]. Calculating a modulus is something like 30-50 cycles according to the few google searches I found, and everything else should be significantly less than that.

The number of mods we need to calculate for any i is the number of known primes, which is roughly ilog(i). Let's say we're getting all the primes until 100,000. Then we should need to calculate about Σ99,999i=3ilog(i), which is about 200 million mods. Times 30-50 cycles is about .6-1 billion cycles. On a 3gHz processor, that should take about a 200-333 milliseconds.

What do we actually see?

using BenchmarkTools
@btime modified_sieve(100_000)
> 352.089 ms (240255 allocations: 2.12 GiB)

That...shouldn't actually happen. Actual programs usually take 3x-10x as long as my estimates, because there's a lot of overhead I'm not accounting for. This suggests that there's some optimization happening that I didn't expect. And looking at the generated code, I do see several references to simdloop. Ok then. That just happened to cancel out the overhead.

Let's try an algorithmic improvement. We don't actually need to check against all primes – we only have to go up to √i each time. That brings us down to an estimated 2 million mods, which in theory should be 100 times faster.

function modified_sieve2(n)
known_primes = [2]
for i in 3:2:n
mods = [i % p for p in known_primes if p^2 < i]
if all(mods .> 0)
push!(known_primes, i)
end
end
return known_primes
end
@btime modified_sieve2(100_000)
> 104.270 ms (336536 allocations: 83.04 MiB)

Hmm. It's faster, but nowhere near as much as expected – 3.5x vs. 100x. Adding the filter each loop probably messes us up in a couple ways: it adds a fair bit of work, and it makes it harder for the CPU to predict which calculations need to happen. Let's see if we can get a speed up by making the primes to be checked more predictable:

function modified_sieve3(n)
known_primes = [2]
small_primes = Int[]
next_small_prime_index = 1
for i in 3:2:n
if known_primes[next_small_prime_index]^2 <= i
push!(small_primes, known_primes[next_small_prime_index])
next_small_prime_index += 1
end
mods = [i % p for p in small_primes]
if all(mods .> 0)
push!(known_primes, i)
end
end
return known_primes
end

Now we added a bunch of extra code to make the mod calculations more predictable: we track the primes that need to be tested separately, and only update them when needed. And it pays off:

@btime modified_sieve3(100_000)
> 9.291 ms (150010 allocations: 25.87 MiB)

This is a major benefit of making Fermi estimates. We predicted a possible 100x speedup, and got only 3.5x. That suggested that there was more on the table, and were able to get a further 10x. If I hadn't made the estimate, I might have thought the filter version was the best possible, and stopped there.

And I find this to translate pretty well to real projects – the functions where I estimate the greatest potential improvements are the ones that I actually end up being able to speed up a lot, and this helps me figure out whether it's worth spending time optimizing a piece of code.

This is a "live" Fermi estimate, where I expect to make mistakes and haven't done much editing.

If you're working on a simple mathematical program, you can attempt to estimate the number of floating point or integer operations, compare that to statistics on CPU latencies/throughput, and use that to get a rough estimate.

Let's try estimating how long it would take to run a simple primality test, a modified Sieve of Eratosthenes:

In this algorithm, we store a list of known prime numbers. We iterate through odd numbers, and check whether any of them are divisible by a known prime, by calculating the modulus against every known prime in our list. If it's not divisible by any of them, we add it to our list of primes.

There's a lot we could do to come up with a better algorithm, but what I'm interested in is estimating how fast an ideal execution of this algorithm could be. How long could we expect this to run, on a modern CPU?

I expect, looking at this, that the vast majority of the time is going to be spent on

`mods = [i % p for p in known_primes]`

. Calculating a modulus is something like 30-50 cycles according to the few google searches I found, and everything else should be significantly less than that.The number of mods we need to calculate for any

`i`

is the number of known primes, which is roughly ilog(i). Let's say we're getting all the primes until 100,000. Then we should need to calculate about Σ99,999i=3ilog(i), which is about 200 million mods. Times 30-50 cycles is about .6-1 billion cycles. On a 3gHz processor, that should take about a 200-333 milliseconds.What do we actually see?

That...shouldn't actually happen. Actual programs usually take 3x-10x as long as my estimates, because there's a lot of overhead I'm not accounting for. This suggests that there's some optimization happening that I didn't expect. And looking at the generated code, I do see several references to

`simdloop`

. Ok then. That just happened to cancel out the overhead.Let's try an algorithmic improvement. We don't actually need to check against all primes – we only have to go up to √i each time. That brings us down to an estimated 2 million mods, which in theory should be 100 times faster.

Hmm. It's faster, but nowhere near as much as expected – 3.5x vs. 100x. Adding the filter each loop probably messes us up in a couple ways: it adds a fair bit of work, and it makes it harder for the CPU to predict which calculations need to happen. Let's see if we can get a speed up by making the primes to be checked more predictable:

Now we added a bunch of extra code to make the

`mod`

calculations more predictable: we track the primes that need to be tested separately, and only update them when needed. And it pays off:This is a major benefit of making Fermi estimates. We predicted a possible 100x speedup, and got only 3.5x. That suggested that there was more on the table, and were able to get a further 10x. If I hadn't made the estimate, I might have thought the filter version was the best possible, and stopped there.

And I find this to translate pretty well to real projects – the functions where I estimate the greatest potential improvements are the ones that I actually end up being able to speed up a lot, and this helps me figure out whether it's worth spending time optimizing a piece of code.