Attention readers: This blog has moved to a new home at https://chenghlee.wordpress.com/.

Friday, January 18, 2008

Notes on standard deviation

I picked up these facts while working on a derivation for work, mainly from a discussion of standard deviation on Mark Chu-Carroll's blog. When most of us learn basic statistics, we are taught the 68/95/99.7% rule; that is, for a Gaussian (normal) distribution, ~68% of the data lie within 1 standard deviation of the mean (μ ± 1σ), ~95% within μ ± 2σ, and ~99.7% within μ ± 3σ. However, out in the "real" world, we rarely come across perfect Gaussian distributions or even distributions we can approximate as being Gaussian. Thus, we can't apply this rule to make sensible statements about how data cluster around the mean. Fortunately, there are two theorems (given below without proof) that do allow us to make such statements, regardless of the distribution1.

Chebyshev's inequality tells us that for any distribution, no more than 1/λ2 of the data are more than λ standard deviations away from the mean; equivalently, at least (1 - 1/λ2) of the data are within λ standard deviations from the mean. So, regardless of the distribution, at least 75% of the data are within μ ± 2σ, and ~88.9% are within μ ± 3σ.

Further, if the distribution is unimodal, we can refine our statements regarding the distribution of data around the mean, yielding the Vysochanskiï-Petunin inequality; it states that for all λ > √(8/3) ≈ 1.633, no more than 4/(9λ2) of the data are more than λ standard deviations away from the mean. So, for any unimodal distribution, ~88.9% of the data are within μ ± 2σ, and ~95.1% are within μ ± 3σ. Note that the λ > √(8/3) limit is important; without it, the distribution X must be symmetric about the mean (i.e., X(μ + x0) = X(μ - x0)) in order for the (4/9) factor to be correct.

Other statements about the distribution of data about a mean or mode do exist (e.g., the Camp-Meidell inequality or the more general Gauss-Winckler inequality). However, these have more complicated assumptions about the data distribution, and won't be discussed here (for now).


[1] Actually, that's not entirely true. Certain distributions, such as the Cauchy distribution, either have an undefined or infinite variance; in these cases, none of the theorems discussed above apply, and we can't say anything about the distribution of data around the mean. For the purposes of this post, then, "distribution" should be read as "distribution with a finite variance".

No comments: