Re: Perl routine for cluster detection



vincent64@xxxxxxxxx wrote:
Here is some elementary code to detect the presence of a clustering
structure in a 2-dimensional dataset. It's more heuristic than
scientific, so take it with a grain of salt, as even the concept of
cluster is highly fuzzy.

Before going into details, I'd like to ask what
you think what the following part of your program
does:

...
sub seed {
local($x,$y)=@_;
$kmax=1000;

$x=rand($x)-0.5;
$y=rand($y)-0.5;

for ($k=0; $k<$kmax; $k++) {
print "\t$cluster [$1]\n";
$x=$x+rand($1)-0.5;
$y=$y+rand($1)-0.5;
$px[$k]=$x;
$py[$k]=$y;
}
...


Aside from beeing not able to run under 'strict',
what's meant with

$x=$x+rand($1)-0.5;
$y=$y+rand($1)-0.5;

because '$1' is, at this point, not set.

The seed routine creates a cluster of 1000 points, saved in
cluster.txt: each row corresponds to a point; the first column is the
cluster number, and the next two columns are the x and y coordinates.

Don't do that. The convention in this business is.
First comes x, then y, then z. Because your 'cluster number'
is somehow 'a plane' in your problem space, you should make
it that ('z', third column).

The cluster number is automatically incremented each time a new call
to seed is made, resulting in the creation of a new cluster. The
distance routine computes the distance between two points, for 100
points randomly selected in the data set previously created
(cluster.txt). The output is a file dist.txt, with one row per pair
of points, with two fields: the first column is an indicator and is
equal to 1 if both points belong to the same cluster; the second column is
the distance between the two points. This script illustrates that it
is possible to check whether a data set contains one or two clusters
by looking at the distribution of distances: a gap in the
distribution means the presence of distinct clusters. It also suggests that the computational complexity of computing whether a data set contains one of more clusters is well below O(n), possibly O(n0.5), if one uses
sampling techniques.

Whats the point of that? You have, say 10^7 2D-points, then you
select 100 pair-samples from them, compute their distance and
claim you have 'complexity well below O(n), possibly O(n0.5)'?

I don't get that ...

==>your code was: datashaping.com/cluster_pl.txt

I'd recommend to translate the code from Perl3-style
to Perl5, which is not really that difficult, because
the code does basically almost nothing.

Starting point: ==>

use strict;
use warnings;

my $idclust = 0;

dmp_seed([1.0,1.0], 1000, $idclust++, '>cluster.txt');
dmp_seed([25.0,25.0], 1000, $idclust++, '>>cluster.txt');

distance(100, 'cluster.txt', 'dist.txt'); # nsamp read write


sub distance {
my ($nsamp, $fn_clust, $fn_dist) = @_;

open my $fc,'<', $fn_clust or die "no coord in: $!";
my @pc = map [/(\S+)/g], <$fc>;
close $fc;

open my $fd, '>', $fn_dist or die "no dist out: $!";
for (1 .. $nsamp) {
my ($pm, $pn) = ( $pc[int rand @pc], $pc[int rand @pc] );
printf $fd "%d\t%.8f\n", 1-($pm->[2] == $pn->[2]),
sqrt(($pm->[0]-$pn->[0])**2 + ($pm->[1]-$pn->[1])**2)
}
close $fd;
}

sub dmp_seed {
my ($rseed, $nmax, $cluid, $fname_mod) = @_;
my ($x, $y) = map $_+rand(1)-0.5, @$rseed;

open my $fh, $fname_mod or die "no way out: $!";
for(1 .. $nmax) {
printf $fh "%.8f\t%.8f\t%d\n", $x+=rand(1)-0.5, $y+=rand(1)-0.5, $cluid
}
close $fh;
}
<==

Regards

M.
.



Relevant Pages

  • Re: Does Hubble have to die?
    ... distances to the pleiades open star cluster (tight star cluster visible ... Calibrating Stellar Models with the Pleiades: Resolving the Distance ... ZAMS stars in our sample without the additional uncertainties ... roll constraints at times of maximum parallax factor. ...
    (sci.space.shuttle)
  • Maximum distance for VMS clusters
    ... I contacted Digital Networks, http://www.digitalnetworks.net. ... to them about long distance clustering and the concept of continetal ... The official party line from HP on maximum distance between VMS cluster ...
    (comp.os.vms)
  • Off topic - Spot the error
    ... Here the spreadsheet takes the place of the calculator, and allows the student to see the details of what is happening when a cluster analysis is undertaken. ... Euclidean measures of distance are based on the objects’ values on each of k variables under study. ... These researchers were trying to understand the descent of the modern domestic dog from an examination of the jaws of various specimens of dog-related species found in Thailand. ... The measurement numbers from the above table are used in all the tables ...
    (uk.philosophy.humanism)
  • Re: How PROC CLUSTER deals with Distance data sets ?
    ... My question is about Proc CLUSTER. ... centroid is strongly dependant on the distance used. ... click (left pane) ...
    (sci.stat.consult)
  • K-Means Algorithm
    ... "Given a distance function, a set of centres and a single data-point it ... a set of centres and a set of data-points it ... (loop for cluster in clusters collecting (funcall mean-fn cluster))) ... normalizing-point)) ))) points) ...
    (comp.lang.lisp)