If you know a lot about passwords, you probably know that it’s good to choose a password from a distribution with a lot of “entropy”. The (Shannon) entropy of a distribution is defined as:
This definition is an answer to the question “how much information is encoded in a chosen password, on average?” This is also the approximate number of bits we need to specify which password we picked. An adversary would need to try at least passwords in order to guess yours, on average.
This is typically how people think about choosing strong passwords, and it works well in the case where we’re choosing among equally likely passwords. But it breaks horribly when we allow our distribution to be non-uniform. Because choosing good passwords is about memorableness as well as sheer strength, we might sometimes want to use complicated password selection schemes that don’t necessarily give uniformly distributed selections, and Shannon entropy can really screw up here.
A comically bad password selection scheme
Consider the following password selection algorithm:
- Roll a 20-sided die.
- If the die comes up 20, choose a random password from possibilities.
- Otherwise, choose the password
Of course, this is a terrible password selection strategy. 19 times out of 20, your password will be the very first one the attacker tries, if they know your strategy. But the entropy here doesn’t reflect that - for example, if we pick a random 1000-character lowercase password when we get a 20, this has an overall entropy of about 235 bits. Entropy tells us that we have an extremely strong password, since on average it will take a long time to crack the password. But of course in the typical case, it will take no time at all.
Clearly something is wrong here. How can we fix it?
The first thing to notice is that, as we’ve already implied, we’re engaged in a game of strategy with any would-be password crackers. We pick some strategy for choosing passwords, and then our opponent picks a strategy for guessing them. Our opponent’s best response is going to be to try the most likely passwords first.
So how should we really measure our password selection strategies? One way would be to use the information content of the most likely password, instead of the average information content:
This is called the min-entropy, and it’s nice for several reasons, not least because when is uniform, .
An issue with this method, though, is that it doesn’t let us distinguish the d20 scheme above with 1000-character passwords on a high roll, from a similar scheme with only 3-character passwords on a high roll. Even though the scheme is bad in both cases, it’s still better in the former case than the latter, if only by a little.
But I think the most important thing about the min-entropy is that it gives us a conservative estimate that we can use to compare password selection schemes, while Shannon entropy gives something more like a central estimate. If the min-entropy is bits, we can feel really confident that it will take guesses to crack our typical passwords, even if the distribution isn’t uniform. If one scheme’s min-entropy is better than another’s Shannon entropy, we can feel secure that the first scheme is better.
Thanks to Adam Scherlis for pointing out the issue with Shannon entropy, Katja Grace for discussions about the problem, and Thomas Sepulchre for pointing out a major flaw in the original post.
See Massey's (very short) "Guessing and Entropy" for a discussion of this lower bound.
Good points! My principled take is that you want to minimize your adversary's success probability, as a function of the number of guesses they take.
In the usual case where guessing some wrong password X does nothing to narrow their search other than tell them "the password isn't X", the best the adversary can do is spend their guesses on passwords in order of decreasing likelihood. If they know your password-generating-procedure, their best chance of succeeding in n tries is the sum of the likelihoods of the most common n passwords.
I note also that I could "fix" your roll-a-D20 password-generating procedure by rejecting samples until I get something the adversary assigns low probability to.
This won't work in general though...
It actually would, as long as you reject a candidate password with probability proportional to it's relative frequency. "password" in the above example would be almost certainly rejected as it's wildly more common that one of those 1000-character passwords.
Well, the counterexample I have in mind is: flip a coin until it comes up tails, your password is the number of heads you got in a row.
While you could rejection-sample this to e.g. a uniform distribution on the numbers [0,k), this would take 2k samples, and isn't very feasible. (As you need 22n work to get n bits of strength!)
I do agree that your trick works in all reasonable cases, whenever it's not hard to reach k different possible passwords.
Yep! I originally had a whole section about this, but cut it because it doesn't actually give you an ordering over schemes unless you also have a distribution over adversary strength, which seems like a big question. If one scheme's min-entropy is higher than another's max-entropy, you know that it's better for any beliefs about adversary strength.
I note you do at least get a partial ordering here, where some schemes always give the adversary lower cumulative probability of success as n increases than others.
This should be similar (perhaps more fine grained, idk) than the min-entropy approach. But I haven't thought about it :)
This statement is wrong. If we take your example of choosing the password "password" with probability 19/20 and a uniform random 1000-character lowercase with probability 1/20 (which represents about 24700 different passwords), then the average number of tries needed to guess your password is about 12024700−1, indeed it will take the attacker 24700−1 tries on average in the unlucky situation where you have not chosen the obvious password. This is extremely far from 2235−1.
Um, huh? There are 2^1000 1000-character passwords, not 2^4700. Where is the 4700 coming from?
(added after realizing the above was super wrong): Whoops, that's what I get for looking at comments first thing in the morning. log2(26^1000) = 4700 Still, the following bit stands:
I'd also like to register that, in my opinion, if it turns out that your comment is wrong and not my original statement, it's really bad manners to have said it so confidently.
(I'm now not sure if you made an error or if I did, though)
Update: I think you're actually totally right. The entropy gives a lower bound for the average, not the average itself. I'll update the post shortly.
I apologize, my wording was indeed rude.
As for why I was confident, well, this was a clear (in my eye at least) example of Jensen's inequality: we are comparing the mean of the log to the log of the mean. And, if you see this inequality come up often, you know that the inequality is always strict, except for a constant distribution. That's how I knew.
As a last note, I must praise the fact that you left your original comments (while editing them of course) instead of removing them. I respect you a lot for that.
I thought password entropy was already defined in terms of optimal encodings, which would already handle this? It's unusual to generate a password non-uniformly, but encodings like Huffman codes show us how it would work.
For example, if your password scheme is 99.9% "password" and 0.1% a random length ASCII string, the optimal encoding would be something like 1 = "password", 0 + ASCII string = that ASCII string. Given that encoding, "password" only has one bit of entropy.
To clarify a point in my sibling comment, the concept of "password strength" doesn't cleanly apply to an individual password. It's too contingent on factors that aren't within the password itself. Say I had some way of scoring passwords on their strength, and that this scoring method tells me that "correct horse battery staple" is a great password. But then some guy puts that password in a webcomic read by millions of people - now my password is going to be a lot worse, even though the content of the password didn't change.
Password selection schemes aren't susceptible to this kind of problem, and you can consistently compare the strength of one with the strength of another, using methods like the ones I'm talking about in the OP.
Ah I see what you're saying. Given a non-uniform password generation algorithm (like 99.9% "password", 0.1% random 1000-bit ASCII string), taking the average entropy of the schema is misleading, since the average password has 50 bits of entropy but 99.9% of the time the password will have 1 bit of entropy, and in this case taking the worst-entropy result (1 bit) is more useful than the average.
I think you're right, but this doesn't seem very useful, since any password generation scheme where this is relevant is a bad idea (since if you switched to a uniform distribution, you could either have stronger passwords with the same length, or just as strong passwords with a shorter length).
I disagree; as the post mentions, sometimes considerations such as memorability come into play. One example might be choosing random English sentences as passwords. You might do that by choosing a random parse tree of a certain size. But some English sentences have ambiguous parses, i.e. they'll have multiple ways to generate them. You *could* try to sample to avoid this problem, but it becomes pretty tricky to do that carefully. If you instead find the "most ambiguous sentence" in your set, you can get a lower bound on the safety of your scheme.
One feature of good password schemes is that you have some way to recover lost passwords.
Let's say I chose as my password: "Tithptacsp,aiwwitcwwcaelp". That password has plenty of entropy if you just look at it.
Then I might write down in some not "Entropy isn't sufficient to measure - 3-1". This allows me to go back to this post to look up the third paragraph and take the first sentence of it. Then I find the sentence "This is typically how people think about choosing strong passwords, and it works well in the case where we’re choosing among equally likely passwords" and can reconstruct "Tithptacsp,aiwwitcwwcaelp". Sentences are also generally good mnemonics.
If someone would however know that I'm using LessWrong as my source for passwords this way, that would allow them to just go through all sentences on LessWrong posts which radically reduces the entropy.
I don't think that's how people normally do it; partly because I think it makes more sense to try to find good password *schemes*, rather than good individual passwords, and measuring a password's optimal encoding requires knowing the distribution of passwords already. The optimal encoding story doesn't help you choose a good password scheme; you need to add on top of it some way of aggregating the code word lengths. In the example from the OP, you could use the average code word length of the scheme, which has you evaluating Shannon entropy again, or you could use the minimum code word length, which brings you back to min-entropy.
This seems like the exact sort of thing expected utility theory is made for.
Entropy is given by E[log 1/P]. Maximizing entropy can be seen as using a password-selection method with the assumption that log 1/P is the utility of a given password. Is that a reasonable utility function? What does it represent?
Well, 1/P seems to represent the rough number of guesses needed to guess a password. Which does seem like a relevant number for evaluating the utility of a password, as the difficulty in guessing it scales directly with the number of tries necessary. But should we take the logarithm of that? Should we just use 1/P directly? If we just use this directly, then we are risk-neutral with respect to guessing difficulty. Is that reasonable?
I don't think so, and here's a model: You might imagine that you have a distribution of adversaries. Some might just be script kiddies who are going to try 1-3 logical guesses and move on. Most might be using bots or botnets and guessing more, say 1000000 guesses or so. And small upper tail would consist of e.g. state actors who could spend an enormous amount of resources.
Assuming, unrealistically, that each of the adversaries would do equal amounts of harm when guessing your password, it seems like the value of the password difficulty would be in reducing the probability that they guess it, by locking out subset of the adversaries. For simplicity, we could take the cumulative distribution function of the adversary guess budget distribution as our measure of the probability that they will be defeated by the password.
So this model implies that a proper way of selecting is E[cdf(1/P)], where cdf is the cumulative distrobution function of attacker capabilities. The simple E[1/P] version can be seen as using an (improper) uniform prior where there is equal probability for any number of positive guesses, whether that is 1, 64827266195729462, a googolplex, Graham's number, or whatever. Meanwhile, E[log 1/P] can be seen as using a (still improper) prior where the order of magnitude of the adversary's capabilities are uniformly distributed. Both of these are unbounded, as tends to happen with improper priors, but if we used a proper prior, the utilities would be bounded, corresponding to strong risk-aversion.
A little more detail seems worth providing. If you have a pdf (over probabilities) f(p), which may or may not actually integrate to 1 or to any finite value, then the corresponding pdf for our measure of attacker-capability q=1/p is f(1/q)/q^2; the relevant CDF here is the integral from 0 to q of this.
Conversely, if we start with a given cdf F(q), its pdf is F'(q), and the corresponding pdf for p is F'(1/p)/p^2.
So a uniform distribution over q (obviously improper) corresponds to F(q) = q and a pdf of 1/p^2. A uniform distribution over log q (again, obviously improper) corresponds to F(q) = log q and a pdf of 1/p.
If we continue this "one step further", a (no longer improper) uniform prior on p means F'(1/p)/p^2 = const or F'(q)=const/q^2 or F(q) = const/q; obviously the constant should be negative. That is, this would mean trying to minimize the expected probability of the password we pick. Equivalently, trying to minimize sum(p^2) over all our possible passwords. So, in particular, for a fixed set of passwords this says to pick them all with equal probability.
How does this figure of demerit do for the example in the OP? Let's generalize it a little, so that with probability p we pick one password and with probability 1/p we pick one of N passwords.
The implied model of our attackers here is that the probability that an attacker can afford to guess at least N passwords goes down in proportion to 1/N. Not obviously very realistic, but not obviously worse than the improper priors considered above either :-).
tailcalled alluded to the following but didn't make it explicit: If your prior is improper then you can make that CDF get unboundedly large, which means you can compensate for badness elsewhere that way, which means the pathology OP is about can happen; if your prior is proper then you can't. So propriety is in some sense exactly the property we want here.
[EDITED to add two things:] 1. After some discussion with tailcalled, there's a rewritten and hopefully clearer version of most of the above in a comment of mine further downthread. 2. I see that this one has received someone's strong-downvote. For the avoidance of doubt, I'm not complaining; you should strong-downvote things that you think are very bad; but it would be much more useful information if I understood what the problem is, and right now I don't. Did I make a mathematical error? Was my writing too unclear to be useful (e.g., because "probabilities" was a bad way to refer to 1/attacker_capacity, or because looking at things as a function of p = 1/attacker_capacity is needlessly confusing)? Did it feel as if I was trying to restate what tailcalled already said and take credit for it? Etc. Thanks!
I don't think I understand. What do you mean by a pdf over probabilities? Like I get how one can have a pdf over probabilities in e.g. a Laplace's law of succession-like situation, but I'm not sure what probabilities you have a pdf over here.
I said "probabilities" but should really have said something like "reciprocals of numbers of trials". The thing you called P, when you wrote things like "E[1/P]" and "E[log 1/P]". This has the same "units" as probability; if you wanted to express it in explicitly probabilistic terms it would be something like "least probable password the attacker has the resources to try".
It's fair enough to call that probabilities, I guess that just means that the thing I was really confused about is how you derive the CDF from the probabilities. That is, this part:
Under my framework, the distribution of amount of guesses that an attacker is willing to go through is a separate independent parameter that one can choose. I don't see where the f(1/q)/q^2 expression comes from.
It's a separate independent parameter for me too. My intention (though of course I may have screwed up) is that I'm describing the same thing as you are :-).
So, suppose you have a particular probability distribution for the number of guesses an attacker has. I am describing that by saying what its pdf is as a function of 1/#guesses (the thing I'm calling p). I call this f(p). You're describing it by saying what its cdf is as a function of #guesses (the thing you're calling 1/P). You call this cdf(1/P). These two formulations are equivalent.
 Well, maybe there are some weird technical issues if the cdf is sufficiently horribly non-differentiable -- a Cantor staircase or something of the sort.
Why start with f(p) rather than cdf(1/P)? No particular reason, I guess :-). It would probably have been cleaner just to define q=1/p (as I did) and then work with either the pdf or the cdf as a function of q.
So, let me try again -- I am not saying anything different, or at least not intending to, just cleaning up my notation and language and so forth -- and see whether you find the result (1) comprehensible and (2) compatible with what you said before. (Again, to be clear, I am claiming only to explicate what you already said, which I found excellent but a bit terse.)
A little more detail seems worth providing. Let's say that an attacker has "resources R" if they are willing and able to crack passwords with probabilities at least 1/R. (Is this a realistic model? I think it is provided R is fairly large, which in practice it usually will be, provided attackers know the probability distribution we are using. If I am an attacker and know e.g. that passwords of type A are used 10x more often than passwords of type B, then I will use 10x more of my resources on type-A passwords. This falls down if I have enough resources that I literally can't do that without repeating passwords, but for actually-realistic password distributions and attackers who are likely attacking multiple victims at once this is not a problem.)
So, we can describe the distribution of attackers in terms of the pdf f(R). In some sense, as described above, values of R are reciprocals of password-probabilities; the corresponding pdf over probabilities is f(1/p)/p^2 where p=1/R. If we start with the cdf F(R) instead, the pdf is f'(R), the pdf over probabilities is F'(1/p)/p^2, and of course the cdf for probabilities is just 1-F(1/p).
In each case, tailcalled's prescription is to take E[cdf(1/p)], the expected fraction of attackers who fail to crack our password, as our figure of merit. The expectation is taken over our passwords with the probability distribution we are using (the probability distribution over passwords, not over attackers) so it equals the sum of p.F(1/p) where F is the cdf for attacker-resources.
These things I've called "pdfs" and "cdfs" may not actually integrate/tend to 1; we may leave them unnormalized, or indeed they may be improper (so that the integral of the pdf / limiting value of the cdf is infinite). What we are doing with these is just maximizing/minimizing some expectation involving them; rescaling the values of the pdf/cdf doesn't change the result (so we needn't care about normalization) and in many cases the expectation will converge even if the pdf/cdf is improper.
Let's look more explicitly at the examples tailcalled mentioned. A uniform distribution for R, which of course is very improper indeed, corresponds to F(R)=R or f(R)=1. The corresponding pdf for p=1/R is 1/p^2. A uniform distribution for log R, still improper, corresponds to F(R) = log R or f(R) = 1/R. The corresponding pdf for p=1/R is 1/p. This is exactly the case in which expected utility = entropy.
(As tailcalled implied but didn't quite say explicitly, the fact that these are improper is what enables the pathology described by OP: as R gets large the cdf -> infinity, so a modest probability of picking extremely-low-probability passwords can give an arbitrarily large positive contribution to the expectation we're trying to maximize, even if most of the time we do very badly.)
We looked at pdf(p) = 1/p^2 and pdf(p) = 1/p; let's go one step further and take pdf(p)=1, corresponding to a uniform distribution over 1/R; cdf(R) = 1-1/R, and pdf(R) = 1/R^2. Our figure of merit is E[cdf(1/p)] = E[1-p] = 1-E[p], so the expected negative utility we get is (aside from irrelevant constant offsets and scalings) E[p] = sum(p^2).
(At this point, continuing from "How does this figure of demerit do ..." in the original version of what I wrote seems fine, except that the final paragraph of what I already wrote is effectively incorporated above.)
The place where I'm getting confused is:
In my model, the E is taking an expectation over different passwords. The P refers to the probability assigned to the selected password, not to anything related to the attacker. So 1/P refers to the number of guesses a password requires to crack.
My description of the distribution of attacker capabilities is just named cdf, not cdf(1/P); cdf(1/P) is the amount of attackers who have the capabilities to crack a given password.
... actually I just reread the thread and I think I misunderstood your previous comment:
Since in my original comment, P is not the least probable password the attacker has the resources to try.
Maybe to clean up the math a bit, let's restart with my formulation:
Let W be the distribution of password you are selecting from, and w∼W be a password. Let PW be the prior pmf for W, let A be a distribution of adversary capabilities, and let cdfA be the cdf for A. The utility of a password-selection method is then given by approximately:
The function is cdf. The way it's used in the expected-utility calculation is that it's applied to 1/p where p is the probability of a given password. My original use of the term "probability" for the reciprocal of the thing fed to the cdf function was needlessly confusing, which is why I dropped it in the rewrite.
In your original comment, P is the probability of a particular password. (I say this just to confirm that I do, and did, understand that.)
But if we are going to explain what the cdf function actually is, we need to say something of the form "cdf(R) is the fraction -- or, in the case of improper not-exactly-probability-distributions, something more like the total number -- of attackers for whom ...". And I think the correct way to fill in that "..." is something like "when they crack passwords, we expect that the least probable password they're likely to crack has probability 1/R". (Right?)
In other words, I'm trying to be more explicit about what "adversary capabilities" actually cashes out to, and I think that's what it is.
Your more-explicit formalization of the calculation agrees with my understanding; to whatever extent you feel that what I'm describing is different from what you're describing, I am pretty confident the cause is not that we have different understandings of the mathematics at that point. I think it's you misunderstanding me / me communicating badly, not me misunderstanding you / you communicating badly.
(It is a lamentable misfeature of our language that -- so far as I can tell -- there is no good way to say "what's going on here is that what A is trying to say is not what B is interpreting it as" that doesn't tend to assign blame to one or other party. You have to call it misunderstanding (implicitly blaming B) or miscommunicating (implicitly blaming A). But it takes two to tango and communication failures often involve suboptimality at both ends, and even in cases where it doesn't assigning/taking blame is often an irrelevant distraction.)
Yes, something like that. I'd probably fill it in with "who will probably succeed at cracking passwords w where P(w) is less than or equal to 1/R", but it's a similar point.
So would it make sense to score each password selection scheme on min-entropy, mid-entropy (Shannon) and max-entropy, sort of like algorithms are scored on best case big O, regular big O, and worst case big O?
The number of times doesn't seem like it'd be relevant in practice. The amount of time, does.
Things that would work:
1) Even the most basic dictionary attack (which will include the word "password").
2) A list of simple/common passwords.
Perhaps controversially, I think this is a bad selection scheme even if you replace "password" with any other string.
The cool kids talk about Renyi min-entropy, which, as you say, deals with the skewed distribution issue. I've written that into standards myself (actually I wrote a long definition of the word "unpredictability" using that concept, and then used that all over the document).
But in general, unless you're writing some very formal document, you can just say "entropy" as shorthand and people will know what you're talking about. Nobody is actually going to code your suggested password generation algorithm, and no actual practitioner thinks that "12345678" is a good 8-character password.
A bigger problem is that you don't necessarily know your adversary's state of information, so you don't know what the distribution looks like from their point of view. Not a problem for random passwords, but it can be hard to evaluate a user-generated password.
Attackers aren't given infinite attempts, and even if they are, God doesn't give them infinite time. So what you really want is to minimize the probability that the attacker guesses your password before giving up.
Suppose the attacking bot can make 200,000 attempts. By the first scheme, the probability the attacker guesses the password is .95 (plus an infinitesimal). By the first scheme but with a three-character password on a high roll, the probability is 1.00 (with 50 different characters, there are only 125K three-character words, so success in the remaining 199,999 attempts is certain).
By this measure, both passwords are weak, but the second is weaker than the first.
My Blackberry locks attackers out after 10 tries. So I would choose n=10 rather than n=200000. By that measure, the first scheme is roughly p=.950000, and the second is roughly p=.950072.
Taking it a bit further, these conditional probabilities depend upon some prior for various types of attacks. The numbers you have are P(guess password | password scheme AND attacker guesses via login attempts). They are both still really bad, but the difference is slightly greater if the password can be attacked by somehow exfiltrating a password hash database and running a billion guesses against that.