Re: Standard Deviation
- From: Tom Backer Johnsen <backer@xxxxxxxxxxxx>
- Date: Wed, 12 Oct 2005 17:11:08 +0200
Jan Derk wrote:
Tom Backer Johnsen wrote:
Your alternative still contains the large number, being the sum of the squares, which means that it suffers from the very same you are criticising in Nils's solution.
No, that is not correct. It is true that the sum of the squared differences from the mean is there, but the main information is in the most significant digits, rather than being computed as the difference between two very large numbers, i.e. in the least significant digits.
This line:
Std := Std + (d * (sObs - Avr));
sums the square of the standard deviation of all elements in the array. If the standard deviations are large or if the number of array elements are large or worse both are large you get an overflow. Same problem.
In your other post you show an example with a large average and a small standard deviation to prove your point, but the opposite can easily be true too: a small average with a large standard deviation.
I got bored so created a statistics unit which has no large sumators thus eliminating the overflow problem. The Mean procedure is similar to what you have shown above. It is a 5 minute hack so no garantuees.
Sigh. The argument was about the best way to compute the SS (sum of squared deviations from the mean) OR the variance (the SS divided by N or N-1, depending whether you want the sample value or the estimate of the population value) OR the standard deviation, which is the square root of the variance -- with a *one-pass* algorithm.
Now your example was a tad overly complicated for one thing but probably fine as far as I can see without testing it, but it is a two-pass one nevertheless (first find the mean and then in the second pass find the SS or variance). That may be OK, but may be a clear disadvantage in many situations as well.
Now, the standard solution to the one-pass problem is to run through the values once and accumulate (a) the sum of all the values, and (b) the sum of all the squared values. When the loop is finished, throw the two values at the computational formula:
SS := SumXSquared - (SumX)(SumX) / N
And go on from there with the computation of the variance and the standard deviation as mentioned above. That was what I objected to, my point was simply that with the "wrong kind of data" the square of SumX could imply a loss of accuracy in the least significant digits, and that is precisely where the useful information is, in a (potentially small) difference between two very large numbers. In even worse cases, it would result in inaccuracies in the first term as well. Reflect the simple example I gave in an additional comment. Concequently, to use the standard computational formula as the basis for a one-pass algorithm is not very smart.
You may of course expand the useful range for the algoritm by using type double, but the basic flaw is there in any case.
The statement you objected to is the value of the SS computed directly, without further adjustment needed afterwards, and that one would be accurate for a reasonable number of digits, depending on the use of type single or double. In most cases type double would be overkill.
Tom .
- Follow-Ups:
- Re: Standard Deviation
- From: Jan Derk
- Re: Standard Deviation
- References:
- Standard Deviation
- From: Ed Dressel
- Re: Standard Deviation
- From: Nils Haeck
- Re: Standard Deviation
- From: TeamB
- Re: Standard Deviation
- From: Tom Backer Johnsen
- Re: Standard Deviation
- From: Tom Backer Johnsen
- Re: Standard Deviation
- From: Jan Derk
- Standard Deviation
- Prev by Date: Re: Delphi and VmWare
- Next by Date: Re: Can't compile and run newsreader
- Previous by thread: Re: Standard Deviation
- Next by thread: Re: Standard Deviation
- Index(es):
Relevant Pages
|