Re: data mining




<analyst41@xxxxxxxxxxx> wrote in message
news:1183295103.157455.15890@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
In data mining you are typically dealing with millions of rows of data
and if you are talking about internet browsing data, even 100 million
rows or more.

Let us say that you have 30 attributes (explanatory variables) in each
row plus a response variable (0 = no response 1 = response).

There are all kinds of analysis one can do on such a data set and I
would like some advice on desigining a program to do one of them
("Decision Trees") with Fortran.

We want to split the original data set into two subsets (after which
the analysis can be repeated on each of the two subsets) by splitting
on any one of the thirty attributes. If it is a true-false type
attribute then there is only way to split on that attribute, but in
other cases there would be more choices. The aim of the split is to
create "pure" subsets so that one subset has more responders (based on
count or the proportion or perhaps other measures) than the other and
the top level attribute to be split on would be the one that makes the
difference as high as possible. There might be some constraints as to
how big or small each subset can be.

If we are talking about a small number of rows then this is a pretty
elementary problem as far as I can see it. There are free and
commercial packages that offer to do this - but if one were to do this
from scratch in Fortran, I would appreciate the group's suggestions as
to how this kind of volume of data can be handled.
My own opinion is that so-called "data mining" is methodologically unsound,
to say nothing of its legality. If you are "harvesting" millions of rows of
data, you might as well get your subsets by using the 29 dimensions of
compatability. My suggestion for what to do with the data is throw it out
before a court tells you to.
--
Wade Ward


.



Relevant Pages

  • data mining
    ... In data mining you are typically dealing with millions of rows of data ... and if you are talking about internet browsing data, ... There are all kinds of analysis one can do on such a data set and I ... from scratch in Fortran, I would appreciate the group's suggestions as ...
    (comp.lang.fortran)
  • Re: Write position
    ... when the data set is created. ... Used for Fortran direct access files. ... only F and FB were supported by Fortran formatted I/O. ... in different and fundamentally incompatible physical formats on disk. ...
    (comp.lang.fortran)
  • Re: Write position
    ... There are a few different file formats used, ... when the data set is created. ... only F and FB were supported by Fortran formatted I/O. ... FBS is rarely used. ...
    (comp.lang.fortran)
  • Re: data mining
    ... and if you are talking about internet browsing data, ... would like some advice on desigining a program to do one of them ... from scratch in Fortran, I would appreciate the group's suggestions as ...
    (comp.lang.fortran)
  • Re: data mining
    ... There are all kinds of analysis one can do on such a data set and I ... elementary problem as far as I can see it. ... from scratch in Fortran, I would appreciate the group's suggestions as ... 5-10 attributes that would give you the maximum differentiation among ...
    (comp.lang.fortran)