In the usual case where guessing some wrong password X does nothing to narrow their search other than tell them "the password isn't X", the best the adversary can do is spend their guesses on passwords in order of decreasing likelihood. If they know your password-generating-procedure, their best chance of succeeding in tries is the sum of the likelihoods of the most common $n$ passwords.

[-]James Payor4y30

I note also that I could "fix" your roll-a-D20 password-generating procedure by rejecting samples until I get something the adversary assigns low probability to.

This won't work in general though...

[-]Maximum_Skull4y20

It actually would, as long as you reject a candidate password with probability proportional to it's relative frequency. "password" in the above example would be almost certainly rejected as it's wildly more common that one of those 1000-character passwords.

[-]James Payor4y30

Well, the counterexample I have in mind is: flip a coin until it comes up tails, your password is the number of heads you got in a row.

While you could rejection-sample this to e.g. a uniform distribution on the numbers , this would take $2^{k}$ samples, and isn't very feasible. (As you need $2^{2^{n}}$ work to get $n$ bits of strength!)

I do agree that your trick works in all reasonable cases, whenever it's not hard to reach $k$ different possible passwords.

[-]benwr4y20

Yep! I originally had a whole section about this, but cut it because it doesn't actually give you an ordering over schemes unless you also have a distribution over adversary strength, which seems like a big question. If one scheme's min-entropy is higher than another's max-entropy, you know that it's better for any beliefs about adversary strength.

[-]James Payor4y20

Nice!

I note you do at least get a partial ordering here, where some schemes always give the adversary lower cumulative probability of success as increases than others.

This should be similar (perhaps more fine grained, idk) than the min-entropy approach. But I haven't thought about it :)

[-]Thomas Sepulchre4y50

An adversary would need to try about passwords in order to guess yours, on average.

This statement is wrong. If we take your example of choosing the password "password" with probability $19 / 20$ and a uniform random 1000-character lowercase with probability $1 / 20$ (which represents about $2^{4700}$ different passwords), then the average number of tries needed to guess your password is about $\frac{1}{20} 2^{4700 - 1}$ , indeed it will take the attacker $2^{4700 - 1}$ tries on average in the unlucky situation where you have not chosen the obvious password. This is extremely far from $2^{235 - 1}$ .

[-]benwr4y50

~~Um, huh? There are 2^1000 1000-character passwords, not 2^4700. Where is the 4700 coming from?~~

(added after realizing the above was super wrong): Whoops, that's what I get for looking at comments first thing in the morning. log2(26^1000) = 4700 Still, the following bit stands:

I'd also like to register that, in my opinion, if it turns out that your comment is wrong and not my original statement, it's really bad manners to have said it so confidently.

(I'm now not sure if you made an error or if I did, though)

Update: I think you're actually totally right. The entropy gives a lower bound for the average, not the average itself. I'll update the post shortly.

[-]Thomas Sepulchre4y50

I apologize, my wording was indeed rude.

As for why I was confident, well, this was a clear (in my eye at least) example of Jensen's inequality: we are comparing the mean of the log to the log of the mean. And, if you see this inequality come up often, you know that the inequality is always strict, except for a constant distribution. That's how I knew.

As a last note, I must praise the fact that you left your original comments (while editing them of course) instead of removing them. I respect you a lot for that.

[-]Brendan Long4y50

I thought password entropy was already defined in terms of optimal encodings, which would already handle this? It's unusual to generate a password non-uniformly, but encodings like Huffman codes show us how it would work.

For example, if your password scheme is 99.9% "password" and 0.1% a random length ASCII string, the optimal encoding would be something like 1 = "password", 0 + ASCII string = that ASCII string. Given that encoding, "password" only has one bit of entropy.

[-]benwr4y50

To clarify a point in my sibling comment, the concept of "password strength" doesn't cleanly apply to an individual password. It's too contingent on factors that aren't within the password itself. Say I had some way of scoring passwords on their strength, and that this scoring method tells me that "correct horse battery staple" is a great password. But then some guy puts that password in a webcomic read by millions of people - now my password is going to be a lot worse, even though the content of the password didn't change.

Password selection schemes aren't susceptible to this kind of problem, and you can consistently compare the strength of one with the strength of another, using methods like the ones I'm talking about in the OP.

[-]Brendan Long4y10

Ah I see what you're saying. Given a non-uniform password generation algorithm (like 99.9% "password", 0.1% random 1000-bit ASCII string), taking the average entropy of the schema is misleading, since the average password has 50 bits of entropy but 99.9% of the time the password will have 1 bit of entropy, and in this case taking the worst-entropy result (1 bit) is more useful than the average.

I think you're right, but this doesn't seem very useful, since any password generation scheme where this is relevant is a bad idea (since if you switched to a uniform distribution, you could either have stronger passwords with the same length, or just as strong passwords with a shorter length).

[-]benwr4y40

any password generation scheme where this is relevant is a bad idea

I disagree; as the post mentions, sometimes considerations such as memorability come into play. One example might be choosing random English sentences as passwords. You might do that by choosing a random parse tree of a certain size. But some English sentences have ambiguous parses, i.e. they'll have multiple ways to generate them. You *could* try to sample to avoid this problem, but it becomes pretty tricky to do that carefully. If you instead find the "most ambiguous sentence" in your set, you can get a lower bound on the safety of your scheme.

[-]ChristianKl4y30

One feature of good password schemes is that you have some way to recover lost passwords.

Let's say I chose as my password: "Tithptacsp,aiwwitcwwcaelp". That password has plenty of entropy if you just look at it.

Then I might write down in some not "Entropy isn't sufficient to measure - 3-1". This allows me to go back to this post to look up the third paragraph and take the first sentence of it. Then I find the sentence "This is typically how people think about choosing strong passwords, and it works well in the case where we’re choosing among equally likely passwords" and can reconstruct "Tithptacsp,aiwwitcwwcaelp". Sentences are also generally good mnemonics.

If someone would however know that I'm using LessWrong as my source for passwords this way, that would allow them to just go through all sentences on LessWrong posts which radically reduces the entropy.

[-]benwr4y10

I don't think that's how people normally do it; partly because I think it makes more sense to try to find good password *schemes*, rather than good individual passwords, and measuring a password's optimal encoding requires knowing the distribution of passwords already. The optimal encoding story doesn't help you choose a good password scheme; you need to add on top of it some way of aggregating the code word lengths. In the example from the OP, you could use the average code word length of the scheme, which has you evaluating Shannon entropy again, or you could use the minimum code word length, which brings you back to min-entropy.

[-]tailcalled4y40

This seems like the exact sort of thing expected utility theory is made for.

Entropy is given by E[log 1/P]. Maximizing entropy can be seen as using a password-selection method with the assumption that log 1/P is the utility of a given password. Is that a reasonable utility function? What does it represent?

Well, 1/P seems to represent the rough number of guesses needed to guess a password. Which does seem like a relevant number for evaluating the utility of a password, as the difficulty in guessing it scales directly with the number of tries necessary. But should we take the logarithm of that? Should we just use 1/P directly? If we just use this directly, then we are risk-neutral with respect to guessing difficulty. Is that reasonable?

I don't think so, and here's a model: You might imagine that you have a distribution of adversaries. Some might just be script kiddies who are going to try 1-3 logical guesses and move on. Most might be using bots or botnets and guessing more, say 1000000 guesses or so. And small upper tail would consist of e.g. state actors who could spend an enormous amount of resources.

Assuming, unrealistically, that each of the adversaries would do equal amounts of harm when guessing your password, it seems like the value of the password difficulty would be in reducing the probability that they guess it, by locking out subset of the adversaries. For simplicity, we could take the cumulative distribution function of the adversary guess budget distribution as our measure of the probability that they will be defeated by the password.

So this model implies that a proper way of selecting is E[cdf(1/P)], where cdf is the cumulative distrobution function of attacker capabilities. The simple E[1/P] version can be seen as using an (improper) uniform prior where there is equal probability for any number of positive guesses, whether that is 1, 64827266195729462, a googolplex, Graham's number, or whatever. Meanwhile, E[log 1/P] can be seen as using a (still improper) prior where the order of magnitude of the adversary's capabilities are uniformly distributed. Both of these are unbounded, as tends to happen with improper priors, but if we used a proper prior, the utilities would be bounded, corresponding to strong risk-aversion.

[-]gjm4y*20

A little more detail seems worth providing. If you have a pdf (over probabilities) f(p), which may or may not actually integrate to 1 or to any finite value, then the corresponding pdf for our measure of attacker-capability q=1/p is f(1/q)/q^2; the relevant CDF here is the integral from 0 to q of this.

Conversely, if we start with a given cdf F(q), its pdf is F'(q), and the corresponding pdf for p is F'(1/p)/p^2.

So a uniform distribution over q (obviously improper) corresponds to F(q) = q and a pdf of 1/p^2. A uniform distribution over log q (again, obviously improper) corresponds to F(q) = log q and a pdf of 1/p.

If we continue this "one step further", a (no longer improper) uniform prior on p means F'(1/p)/p^2 = const or F'(q)=const/q^2 or F(q) = const/q; obviously the constant should be negative. That is, this would mean trying to minimize the expected probability of the password we pick. Equivalently, trying to minimize sum(p^2) over all our possible passwords. So, in particular, for a fixed set of passwords this says to pick them all with equal probability.

How does this figure of demerit do for the example in the OP? Let's generalize it a little, so that with probability p we pick one password and with probability 1/p we pick one of N passwords.

The entropy here is p log 1/p + (1-p) log N/(1-p). If we call this E(p,N) then note that E(1/2,N) >= E(1, 2N), so setting p=1/2 can be "fixed" by doubling N.
The sum-of-squared-probabilities is p^2 + (1-p)^2/N. When p=0 this is 1/N. When p>0, even in the limit of infinite N it's 1/p. So for any N, if you increase p beyond 1/sqrt(N) you cannot hope to compensate for this by increasing N further.

The implied model of our attackers here is that the probability that an attacker can afford to guess at least N passwords goes down in proportion to 1/N. Not obviously very realistic, but not obviously worse than the improper priors considered above either :-).

tailcalled alluded to the following but didn't make it explicit: If your prior is improper then you can make that CDF get unboundedly large, which means you can compensate for badness elsewhere that way, which means the pathology OP is about can happen; if your prior is proper then you can't. So propriety is in some sense exactly the property we want here.

[EDITED to add two things:] 1. After some discussion with tailcalled, there's a rewritten and hopefully clearer version of most of the above in a comment of mine further downthread. 2. I see that this one has received someone's strong-downvote. For the avoidance of doubt, I'm not complaining; you should strong-downvote things that you think are very bad; but it would be much more useful information if I understood what the problem is, and right now I don't. Did I make a mathematical error? Was my writing too unclear to be useful (e.g., because "probabilities" was a bad way to refer to 1/attacker_capacity, or because looking at things as a function of p = 1/attacker_capacity is needlessly confusing)? Did it feel as if I was trying to restate what tailcalled already said and take credit for it? Etc. Thanks!

[-]tailcalled4y10

A little more detail seems worth providing. If you have a pdf (over probabilities) f(p), which may or may not actually integrate to 1 or to any finite value, then the corresponding pdf for our measure of attacker-capability q=1/p is f(1/q)/q^2; the relevant CDF here is the integral from 0 to q of this.
Conversely, if we start with a given cdf F(q), its pdf is F'(q), and the corresponding pdf for p is F'(1/p)/p^2.

I don't think I understand. What do you mean by a pdf over probabilities? Like I get how one can have a pdf over probabilities in e.g. a Laplace's law of succession-like situation, but I'm not sure what probabilities you have a pdf over here.

[-]gjm4y20

I said "probabilities" but should really have said something like "reciprocals of numbers of trials". The thing you called P, when you wrote things like "E[1/P]" and "E[log 1/P]". This has the same "units" as probability; if you wanted to express it in explicitly probabilistic terms it would be something like "least probable password the attacker has the resources to try".

[-]tailcalled4y10

It's fair enough to call that probabilities, I guess that just means that the thing I was really confused about is how you derive the CDF from the probabilities. That is, this part:

then the corresponding pdf for our measure of attacker-capability q=1/p is f(1/q)/q^2; the relevant CDF here is the integral from 0 to q of this.

Under my framework, the distribution of amount of guesses that an attacker is willing to go through is a separate independent parameter that one can choose. I don't see where the f(1/q)/q^2 expression comes from.

[-]gjm4y20

It's a separate independent parameter for me too. My intention (though of course I may have screwed up) is that I'm describing the same thing as you are :-).

So, suppose you have a particular probability distribution for the number of guesses an attacker has. I am describing that by saying what its pdf is as a function of 1/#guesses (the thing I'm calling p). I call this f(p). You're describing it by saying what its cdf is as a function of #guesses (the thing you're calling 1/P). You call this cdf(1/P). These two formulations are equivalent[1].

[1] Well, maybe there are some weird technical issues if the cdf is sufficiently horribly non-differentiable -- a Cantor staircase or something of the sort.

Why start with f(p) rather than cdf(1/P)? No particular reason, I guess :-). It would probably have been cleaner just to define q=1/p (as I did) and then work with either the pdf or the cdf as a function of q.

So, let me try again -- I am not saying anything different, or at least not intending to, just cleaning up my notation and language and so forth -- and see whether you find the result (1) comprehensible and (2) compatible with what you said before. (Again, to be clear, I am claiming only to explicate what you already said, which I found excellent but a bit terse.)

A little more detail seems worth providing. Let's say that an attacker has "resources R" if they are willing and able to crack passwords with probabilities at least 1/R. (Is this a realistic model? I think it is provided R is fairly large, which in practice it usually will be, provided attackers know the probability distribution we are using. If I am an attacker and know e.g. that passwords of type A are used 10x more often than passwords of type B, then I will use 10x more of my resources on type-A passwords. This falls down if I have enough resources that I literally can't do that without repeating passwords, but for actually-realistic password distributions and attackers who are likely attacking multiple victims at once this is not a problem.)

So, we can describe the distribution of attackers in terms of the pdf f(R). In some sense, as described above, values of R are reciprocals of password-probabilities; the corresponding pdf over probabilities is f(1/p)/p^2 where p=1/R. If we start with the cdf F(R) instead, the pdf is f'(R), the pdf over probabilities is F'(1/p)/p^2, and of course the cdf for probabilities is just 1-F(1/p).

In each case, tailcalled's prescription is to take E[cdf(1/p)], the expected fraction of attackers who fail to crack our password, as our figure of merit. The expectation is taken over our passwords with the probability distribution we are using (the probability distribution over passwords, not over attackers) so it equals the sum of p.F(1/p) where F is the cdf for attacker-resources.

These things I've called "pdfs" and "cdfs" may not actually integrate/tend to 1; we may leave them unnormalized, or indeed they may be improper (so that the integral of the pdf / limiting value of the cdf is infinite). What we are doing with these is just maximizing/minimizing some expectation involving them; rescaling the values of the pdf/cdf doesn't change the result (so we needn't care about normalization) and in many cases the expectation will converge even if the pdf/cdf is improper.

Let's look more explicitly at the examples tailcalled mentioned. A uniform distribution for R, which of course is very improper indeed, corresponds to F(R)=R or f(R)=1. The corresponding pdf for p=1/R is 1/p^2. A uniform distribution for log R, still improper, corresponds to F(R) = log R or f(R) = 1/R. The corresponding pdf for p=1/R is 1/p. This is exactly the case in which expected utility = entropy.

(As tailcalled implied but didn't quite say explicitly, the fact that these are improper is what enables the pathology described by OP: as R gets large the cdf -> infinity, so a modest probability of picking extremely-low-probability passwords can give an arbitrarily large positive contribution to the expectation we're trying to maximize, even if most of the time we do very badly.)

We looked at pdf(p) = 1/p^2 and pdf(p) = 1/p; let's go one step further and take pdf(p)=1, corresponding to a uniform distribution over 1/R; cdf(R) = 1-1/R, and pdf(R) = 1/R^2. Our figure of merit is E[cdf(1/p)] = E[1-p] = 1-E[p], so the expected negative utility we get is (aside from irrelevant constant offsets and scalings) E[p] = sum(p^2).

(At this point, continuing from "How does this figure of demerit do ..." in the original version of what I wrote seems fine, except that the final paragraph of what I already wrote is effectively incorporated above.)

[-]tailcalled4y10

So, suppose you have a particular probability distribution for the number of guesses an attacker has. I am describing that by saying what its pdf is as a function of 1/#guesses (the thing I'm calling p). I call this f(p). You're describing it by saying what its cdf is as a function of #guesses (the thing you're calling 1/P). You call this cdf(1/P). These two formulations are equivalent[1].

The place where I'm getting confused is:

In my model, the E is taking an expectation over different passwords. The P refers to the probability assigned to the selected password, not to anything related to the attacker. So 1/P refers to the number of guesses a password requires to crack.

My description of the distribution of attacker capabilities is just named cdf, not cdf(1/P); cdf(1/P) is the amount of attackers who have the capabilities to crack a given password.

... actually I just reread the thread and I think I misunderstood your previous comment:

I said "probabilities" but should really have said something like "reciprocals of numbers of trials". The thing you called P, when you wrote things like "E[1/P]" and "E[log 1/P]". This has the same "units" as probability; if you wanted to express it in explicitly probabilistic terms it would be something like "least probable password the attacker has the resources to try".

Since in my original comment, P is not the least probable password the attacker has the resources to try.

Maybe to clean up the math a bit, let's restart with my formulation:

Let be the distribution of password you are selecting from, and $w \sim W$ be a password. Let $P_{W}$ be the prior pmf for $W$ , let $A$ be a distribution of adversary capabilities, and let $c d f_{A}$ be the cdf for $A$ . The utility of a password-selection method is then given by approximately:

$U (W) = E_{w \sim W} [c d f_{A} (1 / P_{W} (w))]$

[-]gjm4y20

The function is cdf. The way it's used in the expected-utility calculation is that it's applied to 1/p where p is the probability of a given password. My original use of the term "probability" for the reciprocal of the thing fed to the cdf function was needlessly confusing, which is why I dropped it in the rewrite.

since in my original comment, P is not the least probable password the attacker has the resources to try.

In your original comment, P is the probability of a particular password. (I say this just to confirm that I do, and did, understand that.)

But if we are going to explain what the cdf function actually is, we need to say something of the form "cdf(R) is the fraction -- or, in the case of improper not-exactly-probability-distributions, something more like the total number -- of attackers for whom ...". And I think the correct way to fill in that "..." is something like "when they crack passwords, we expect that the least probable password they're likely to crack has probability 1/R". (Right?)

In other words, I'm trying to be more explicit about what "adversary capabilities" actually cashes out to, and I think that's what it is.

Your more-explicit formalization of the calculation agrees with my understanding; to whatever extent you feel that what I'm describing is different from what you're describing, I am pretty confident the cause is not that we have different understandings of the mathematics at that point. I think it's you misunderstanding me / me communicating badly, not me misunderstanding you / you communicating badly.

(It is a lamentable misfeature of our language that -- so far as I can tell -- there is no good way to say "what's going on here is that what A is trying to say is not what B is interpreting it as" that doesn't tend to assign blame to one or other party. You have to call it misunderstanding (implicitly blaming B) or miscommunicating (implicitly blaming A). But it takes two to tango and communication failures often involve suboptimality at both ends, and even in cases where it doesn't assigning/taking blame is often an irrelevant distraction.)

[-]tailcalled4y10

In your original comment, P is the probability of a particular password. (I say this just to confirm that I do, and did, understand that.)

Yes.

But if we are going to explain what the cdf function actually is, we need to say something of the form "cdf(R) is the fraction -- or, in the case of improper not-exactly-probability-distributions, something more like the total number -- of attackers for whom ...". And I think the correct way to fill in that "..." is something like "when they crack passwords, we expect that the least probable password they're likely to crack has probability 1/R". (Right?)

Yes, something like that. I'd probably fill it in with "who will probably succeed at cracking passwords w where P(w) is less than or equal to 1/R", but it's a similar point.

(It is a lamentable misfeature of our language that -- so far as I can tell -- there is no good way to say "what's going on here is that what A is trying to say is not what B is interpreting it as" that doesn't tend to assign blame to one or other party. You have to call it misunderstanding (implicitly blaming B) or miscommunicating (implicitly blaming A). But it takes two to tango and communication failures often involve suboptimality at both ends, and even in cases where it doesn't assigning/taking blame is often an irrelevant distraction.)

Yes.

[-]Yoav Ravid4y30

So would it make sense to score each password selection scheme on min-entropy, mid-entropy (Shannon) and max-entropy, sort of like algorithms are scored on best case big O, regular big O, and worst case big O?

[-]Pattern4y20

Of course, this is a terrible password selection strategy. 19 times out of 20, your password will be the very first one the attacker tries, if they know your strategy.

The number of times doesn't seem like it'd be relevant in practice. The amount of time, does.

Things that would work:

1) Even the most basic dictionary attack (which will include the word "password").

2) A list of simple/common passwords.

[-]benwr4y10

Perhaps controversially, I think this is a bad selection scheme even if you replace "password" with any other string.

[+][comment deleted]4y10

[-]jbash4y10

The cool kids talk about Renyi min-entropy, which, as you say, deals with the skewed distribution issue. I've written that into standards myself (actually I wrote a long definition of the word "unpredictability" using that concept, and then used that all over the document).

But in general, unless you're writing some very formal document, you can just say "entropy" as shorthand and people will know what you're talking about. Nobody is actually going to code your suggested password generation algorithm, and no actual practitioner thinks that "12345678" is a good 8-character password.

A bigger problem is that you don't necessarily know your adversary's state of information, so you don't know what the distribution looks like from their point of view. Not a problem for random passwords, but it can be hard to evaluate a user-generated password.

[-]Phil4y10

Attackers aren't given infinite attempts, and even if they are, God doesn't give them infinite time. So what you really want is to minimize the probability that the attacker guesses your password before giving up.

Suppose the attacking bot can make 200,000 attempts. By the first scheme, the probability the attacker guesses the password is .95 (plus an infinitesimal). By the first scheme but with a three-character password on a high roll, the probability is 1.00 (with 50 different characters, there are only 125K three-character words, so success in the remaining 199,999 attempts is certain).

By this measure, both passwords are weak, but the second is weaker than the first.

My Blackberry locks attackers out after 10 tries. So I would choose n=10 rather than n=200000. By that measure, the first scheme is roughly p=.950000, and the second is roughly p=.950072.

[-]JBlack4y10

Taking it a bit further, these conditional probabilities depend upon some prior for various types of attacks. The numbers you have are P(guess password | password scheme AND attacker guesses via login attempts). They are both still really bad, but the difference is slightly greater if the password can be attacked by somehow exfiltrating a password hash database and running a billion guesses against that.

[+][comment deleted]4y20

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

36

Entropy isn't sufficient to measure password strength

36

36

A comically bad password selection scheme

Min-entropy