Re: data mining
- From: "Wade Ward" <invalid@xxxxxxxxxxxx>
- Date: Sun, 1 Jul 2007 14:16:24 -0400
<analyst41@xxxxxxxxxxx> wrote in message
news:1183295103.157455.15890@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
In data mining you are typically dealing with millions of rows of dataMy own opinion is that so-called "data mining" is methodologically unsound,
and if you are talking about internet browsing data, even 100 million
rows or more.
Let us say that you have 30 attributes (explanatory variables) in each
row plus a response variable (0 = no response 1 = response).
There are all kinds of analysis one can do on such a data set and I
would like some advice on desigining a program to do one of them
("Decision Trees") with Fortran.
We want to split the original data set into two subsets (after which
the analysis can be repeated on each of the two subsets) by splitting
on any one of the thirty attributes. If it is a true-false type
attribute then there is only way to split on that attribute, but in
other cases there would be more choices. The aim of the split is to
create "pure" subsets so that one subset has more responders (based on
count or the proportion or perhaps other measures) than the other and
the top level attribute to be split on would be the one that makes the
difference as high as possible. There might be some constraints as to
how big or small each subset can be.
If we are talking about a small number of rows then this is a pretty
elementary problem as far as I can see it. There are free and
commercial packages that offer to do this - but if one were to do this
from scratch in Fortran, I would appreciate the group's suggestions as
to how this kind of volume of data can be handled.
to say nothing of its legality. If you are "harvesting" millions of rows of
data, you might as well get your subsets by using the 29 dimensions of
compatability. My suggestion for what to do with the data is throw it out
before a court tells you to.
--
Wade Ward
.
- Follow-Ups:
- Re: data mining
- From: analyst41
- Re: data mining
- References:
- data mining
- From: analyst41
- data mining
- Prev by Date: Re: Please help me understand my code
- Next by Date: Re: survey: Fortran on VMS
- Previous by thread: data mining
- Next by thread: Re: data mining
- Index(es):
Relevant Pages
|