Computational complexity of cluster detection



Here is some elementary code to detect the presence of a clustering
structure in a 2-dimensional dataset. It's more heuristic than
scientific, so take it with a grain of salt, as even the concept of
cluster is highly fuzzy.

The seed routine creates a cluster of 1000 points, saved in
cluster.txt: each row corresponds to a point; the first column is the
cluster number, and the next two columns are the x and y coordinates.
The cluster number is automatically incremented each time a new call
to seed is made, resulting in the creation of a new cluster. The
distance routine computes the distance between two points, for 100
points randomly selected in the data set previously created
(cluster.txt). The output is a file dist.txt, with one row per pair of
points, with two fields: the first column is an indicator and is equal
to 1 if both points belong to the same cluster; the second column is
the distance between the two points. This script illustrates that it
is possible to check whether a data set contains one or two clusters
by looking at the distribution of distances: a gap in the distribution
means the presence of distinct clusters. It also suggests that the
computational complexity of computing whether a data set contains one
of more clusters is well below O(n), possibly O(n0.5), if one uses
sampling techniques.

Source code: http://datashaping.com/regress.shtml

.



Relevant Pages

  • Re: Perl routine for cluster detection
    ... each row corresponds to a point; the first column is the ... cluster number, and the next two columns are the x and y coordinates. ... distance routine computes the distance between two points, ... is possible to check whether a data set contains one or two clusters ...
    (comp.lang.perl.misc)
  • Computational complexity of cluster detection
    ... Here is some elementary code to detect the presence of a clustering ... cluster is highly fuzzy. ... distance routine computes the distance between two points, ... the first column is an indicator and is equal ...
    (sci.math)
  • Perl routine for cluster detection
    ... Here is some elementary code to detect the presence of a clustering ... cluster is highly fuzzy. ... each row corresponds to a point; the first column is the ... distance routine computes the distance between two points, ...
    (comp.lang.perl.misc)
  • Angoss SDK sample - Interpreting the clusters?
    ... cluster model shows up in the SegView control. ... The first column makes sense - it talks abt the distinct clusters in the ... But wat are the other pie charts? ...
    (microsoft.public.sqlserver.datamining)
  • Re: Does Hubble have to die?
    ... distances to the pleiades open star cluster (tight star cluster visible ... Calibrating Stellar Models with the Pleiades: Resolving the Distance ... ZAMS stars in our sample without the additional uncertainties ... roll constraints at times of maximum parallax factor. ...
    (sci.space.shuttle)