Re: help with statistics library



Duncan Muirhead wrote:
On Wed, 25 Apr 2007 20:03:28 +0000, Richard Weeks wrote:

<snip>

http://members.shaw.ca/bystander/statsource.html


Some random comments

Duncan, thanks for taking the time to look at my project. Your comments are most helpful.

1. Use const double* where possible; this may allow some compilers
to optimise more and more importantly lets the users see when they
can pass precious data to your routines. Similarly use const for
read only data tables (e.g. t_table05).

OK

2. I think it would be better, for example in min_max, to let the
user pass the pointer to min max rather than min_max allocating it
and returning it. This would allow users to avoid mallocs if they
chose, and if they did malloc then the users code could have the
mallocs and frees paired.

I'll take it under advisement.

3. I don't like agm() at all. On my machine (linux pc, using gcc)
agm(10,30) ran forever. The test for convergence is far too demanding.

This is something I hadn't seen. When the code is compiled with Digital Mars, which I use mostly, agm() works fine. When compiled with several others, including a Windows port of gcc, it hangs in the loop. As pointed out by Jens Toerring, the test for equality may need some kind of scaling factor. Back to the drawing board on this one.


4. Why bother writing log_gamma when math.h has lgamma? Similarly,
why root when you could use pow?

Of course, the whole project is a reinvention of the wheel. Writing it was a learning experience for me and never intended to be something that would compete with all the mature stats packages that are out there.

5. In log_gamma the evaluation of the polynomial is a bit grim.
Its better (in both speed and accuracy) to use Horner's rule.
eg a + bx + cx^2 should be evaluated as (c*x+b)*x+a.

I'll look into Horner's rule.


6. I think its handy in a stats library to have three functions
for each supported distribution: the pdf, the cdf and (not sure
if its got a standard name) an inverse of the cdf, ie given a probability p find x so cdf(x)=p. Of course this would
mean quite a bit of extra work...

7. Somewhere you should state the accuracy to be expected. How
accurate, for example are the cdf's?. For the normal cdf
the usual way to calculate this is to use the incomplete
gamma function (see e.g. NRC). Or even easier use erf() from
math.h. For other distributions there are often better ways to
compute the cdf than numerical integration. It's worth searching
around.

I'll check it out.


8. Generally there's too much repeated code. Could midrange
not call min_max? True there is a (small) performance hit,
but is that important? If it is you could make it even smaller
by use of inline.

Good point, I'll fix it.


9. Why the plethora of variance & standard deviation routines?
Why would the user call one and not another?

I guess I got carried away.


10. You use different equality tests in mode and nmode. Could
this not lead to confusion?

Yes, I'll fix it.

11. I suspect most users would want both the slope and the
the y intercept of the "best fit" line, so it would be
most convenient to provide one routine that gave both,
rather than have a routine each.

12. Quite a few of the paired data routines take an x count
and a y count, but couldn't handle them being different.
For example in slope_bf_line there is a loop
for(idx = 0; idx < xn; idx++)
{
num += (xlist[idx] - xmean) * (ylist[idx] - ymean);
den += (xlist[idx] - xmean) * (xlist[idx] - xmean);
}
If the ylist (of size yn) was in fact smaller than the xlist,
this will lead at best to wrong answers. If, as here, the
x and y data must be the same size, it would give the users
less opportunity for error if they only passed one size.

I do check for differences in the size of x,y data sets:

double slope_bf_line(double *xlist, double *ylist, int xn, int yn)
{
int idx;
double num, den, xmean, ymean;

if(xn != yn)
{
fprintf(stderr,"\nerror: data sets unequal\n"
" in slope_bf_line()\n");
return 0;
}
....
}


13. chi_square looks odd to me. Dividing by the mean of the
data means its only applicable to data with non zero mean.
Duncan

Thanks again.
Richard


.



Relevant Pages

  • Re: help with statistics library
    ... Use const double* where possible; ... This would allow users to avoid mallocs if they ... the pdf, the cdf and (not sure ... Why the plethora of variance & standard deviation routines? ...
    (comp.programming)
  • Re: Chi square functions and integrals
    ... > set of numerical routines from Dew Research. ... > products is a probabilities unit which has functions for returning the cdf ... my ProbabilityDistributions program plots and integrates ... > any f, be it a pdf or not, via Simpson's method. ...
    (borland.public.delphi.language.objectpascal)