Re: Standard Deviation



Jan Derk wrote:
Tom Backer Johnsen wrote:

Your alternative still contains the large number, being the sum of
the squares, which means that it suffers from the very same you are
criticising in Nils's solution.

No, that is not correct.  It is true that the sum of the squared
differences from the mean is there, but the main information is in
the most significant digits, rather than being computed as the
difference between two very large numbers, i.e. in the least
significant digits.

This line:

Std := Std + (d * (sObs - Avr));

sums the square of the standard deviation of all elements in the array.
If the standard deviations are large or if the number of array elements
are large or worse both are large you get an overflow. Same problem.

In your other post you show an example with a large average and a small
standard deviation to prove your point, but the opposite can easily be
true too: a small average with a large standard deviation.

I got bored so created a statistics unit which has no large sumators
thus eliminating the overflow problem. The Mean procedure is similar to
what you have shown above. It is a 5 minute hack so no garantuees.

Sigh. The argument was about the best way to compute the SS (sum of squared deviations from the mean) OR the variance (the SS divided by N or N-1, depending whether you want the sample value or the estimate of the population value) OR the standard deviation, which is the square root of the variance -- with a *one-pass* algorithm.


Now your example was a tad overly complicated for one thing but probably fine as far as I can see without testing it, but it is a two-pass one nevertheless (first find the mean and then in the second pass find the SS or variance). That may be OK, but may be a clear disadvantage in many situations as well.

Now, the standard solution to the one-pass problem is to run through the values once and accumulate (a) the sum of all the values, and (b) the sum of all the squared values. When the loop is finished, throw the two values at the computational formula:

SS := SumXSquared - (SumX)(SumX) / N

And go on from there with the computation of the variance and the standard deviation as mentioned above. That was what I objected to, my point was simply that with the "wrong kind of data" the square of SumX could imply a loss of accuracy in the least significant digits, and that is precisely where the useful information is, in a (potentially small) difference between two very large numbers. In even worse cases, it would result in inaccuracies in the first term as well. Reflect the simple example I gave in an additional comment. Concequently, to use the standard computational formula as the basis for a one-pass algorithm is not very smart.

You may of course expand the useful range for the algoritm by using type double, but the basic flaw is there in any case.

The statement you objected to is the value of the SS computed directly, without further adjustment needed afterwards, and that one would be accurate for a reasonable number of digits, depending on the use of type single or double. In most cases type double would be overkill.

Tom
.



Relevant Pages

  • Re: Variance question
    ... Want to calculate the variance. ... 2)or do square of standard deviation ^2 approach ... that the 'cov' function itself has had to find the sample's mean in order ...
    (comp.soft-sys.matlab)
  • Re: Variance and moment
    ... the usual way of getting a standard deviation is first get the variance and then take its square root. ... Prev by Date: ...
    (sci.math)
  • Re: Square root of a negative rral value
    ... data set that went with this, crashes on trying to take the square ... square root operation before terminating correctly with a final table. ... When the standard deviation is zero, ...
    (comp.lang.fortran)
  • Re: negative adjusted r square
    ... > I am reciving a negative adjusted R square when running my regerssion. ... also find that the residual standard deviation is greater than the original ... Whatever statistics you care to look at you can be pretty confident ...
    (sci.stat.math)
  • For Raja, odds of a 4+ slam champ being >6ft 3 = 2.72%
    ... Also for the sake of statistical analysis 'n' is always considered ... The cumulative mean of the H values, & the standard deviation. ... The Sum of all the means of cumulative frequencies of H = ... then square rooted, which gives. ...
    (rec.sport.tennis)