This post will attempt to formalise the intuition that "it's hard to figure out the turning point of a logistic curve, at least until after that turning point". Ashort visual "proof" can also be found here.

Logistic curves

The logistic curves look like this:

Logistic curves can be specified by three parameters, , , and . Their equation is then:

Note that this is different from that in this article. The turning point of this curve is at (where it takes the value of ) while its supremum is ; it tends to this value as . Take the limit as as being the exponentials:

Figuring out the right curve

We'll imagine a simple Bayesian setup. An analyst of logistic curves is seeing data from one distribution, and has two hypotheses about it: , for values , , and , and with values , , and . We'll designate by and by .

Now, the true distribution is , but the analyst doesn't know that. The question we're asking is thus:

  • Starting from an equal prior on and , how much of what kind of observation will the analyst need to establish that is the true underlying distribution?

Noisy Sampling

If the analyst can sample noiselessly from the curve, then three samples should generally suffice to fully establish , and one sample should generally suffice to distinguish from . So we'll consider the (realistic) situation where there is noise in the samples.

So assume the analyst samples points, at . In return, it gets values, ; these are sampled independently from . This is a normal distribution with mean and standard deviation .

The analyst is assumed to know the vector , and indeed everything about this setup, with one exception: whether the means of these normal distributions are or .

Let be the analyst's probability distribution. Their prior gives equal weight to both hypotheses: . Let be the analyst observing after sampling from ; their posterior is then .

Note that, from our perspective, is a random variable whose distribution we know. Say that:

  • establishes the difference between and if the expectation of is less than .

We could choose other criteria, and this a relatively loose one. It only assumes three bits of information in favour of over . Note that since , we can get probability bounds on as well, from this result; for instance:

  • If , then with probability at least , .

So, for instance, our criteria above ensures that with probability at least , . Conversely, since , probability bounds on translate into expectation bounds, making the two approaches loosely equivalent. We'll use expectation bounds, as they are more natural for this random variable.

Bounding results

Our first result, proven in later sections, is a lower bound on the expectation of :

Here is the error function and is the absolute difference between and . We can then get the slightly looser but more easily computable bound:

How to sample

Sampling very large positive or negative values

Note that:

Hence we can bound the via:

Let ; note this is an increasing function, exponential for very negative .

Assume we sample different values below a very negative ; then if is the minimum of all the for , the contribution of these points to the expectation bound is at most and at least:

This gives our result for very negative values:

  • If noise is irreducible below , then sampling below a very negative will have very little impact on the analyst's posterior. To get a better result, increasing the (exponential effect) is generally more powerful than decreasing (inverse linear effect), and much more powerful than getting more samples (linear effect).

The behaviour for large positive is also clear: unless , and must have different asymptotes. So as long as there is an upper bound on the noise, sampling the curve at large values will cause the expectation of to converge to . For large , this is essentially trying to distinguish from , so each extra sample applies a multiplicative factor to the expected value of . So, for large samples, the probability of the wrong function converges geometrically to zero in the number of samples.

Finding (any) turning point

So, distinguishing and for very low samples is very hard, but distinguishing them for very high samples is generally not very useful. But enough about asymptotic behaviour. The question is: what happens in between, closer to the turning points and of and ?

We can make some scaling and translation choices to simplify , setting and . So the turning point is at ( value ) and the supremum is :

Assume now that the noise is a constant . We want to have a different turning point, so that can see how easy it is to identify this turning point. Let's choose the worst possible scenario: is an exponential function with no turning point:

So, how can the analyst sample so that they have the greatest possible chance of distinguishing between a true function with a turning point at , and a false function with no turning point at all?

We have two free variables: the and of , and we typically want to see how well we can do when sampling below a given . For constant , the elements of the bound are given by:

Define as this function, without the term. We'll now consider , ie we are sampling at any point before the turning point. Then some experimentation allows us to minimize for negative values, by setting and ; given these values, is bounded above by :

Consequently we can use equation (2) to get a bound:

To establish the difference between and , we need this below . Consequently, we need , or

So if the noise is , ie of the value at the turning point, a single data point might suffice. But if the noise is of the value at the turning point, then at least seven samples are needed.

Anyway, that's all the way to the turning point; what about if is chosen so that the value is (two thirds of the value at the turning point) or (a half of the value at the turning point)? To get these, we need and , respectively. We'll also look at past the turning point, and .

Optimising and for all five situations give:

  • For , or .
  • For , or .
  • For , or .
  • For , or .
  • For , or .

But equation (2) gives poor bounds for low . Using equation (1) instead, for ( of turning point -value) and ( of turning point -value), gives the number of samples needed as:

Other difficulties

The bounds above are only good if the values are sampled independently and close to the peak of the function. If the values are not independent - as values sampled close to each other tend not to be - then more must be sampled, and the same goes if the values are sampled away from the peaks.

The other issue is that, here, we've first optimised and for minimal peak of , then assumed the best were sampled. We need to consider the opposite situations, too: given the sampled , optimise and . So, even if samples are enough to distinguish from this specific , there are other exponential functions that would be harder to distinguish from .

Proof

This section will prove the bounds in equation (1) and (2).

By Bayes rule:

since the prior probabilities are equal. Since the analyst knows the true variances, and similarly for : we can replace the analyst's probabilities with the true probabilities. So, contracting as and as , we get:

To get the true expectation of this , we need to integrate over the possible values of , weighted by the true probability of this happening:

Note that and are both (strictly) positive, and that is half the harmonic mean of the two.

The harmonic mean of any number of positive elements is bounded below by the minimum value of its arguments. Hence:

Now, since the noise is independent, where and . For positive elements, the minimum of two products is greater than or equal to the product of minimums, so

The expressions can be expressed analytically. If is the probability density function of , the normal distribution with mean and variance , then

So the two curves are normal curves with the same variance and means and . Assume, without loss of generality, that . Then the two functions will be equal at the midpoint , and for , is higher, while for , is higher.

Thus

If is the distance between the two peaks, this becomes:

Since the integral of is , for the error function, we can bound the expected probability by:

For positive values, the error function is concave, and it has derivative at the origin, so

Consequently

Using the fact that for positive, , we get a final bound:

New to LessWrong?

New Comment